1 Logging into the Xanadu cluster

To access the cluster you need to login with ssh (secure shell):

ssh <user_name>@xanadu-submit-ext.cam.uchc.edu 

# you user name looks like this:
ssh ssun@xanadu-submit-ext.cam.uchc.edu 

2 Tranferring data to and from the cluster

To move files in between computer you can login with sftp use scp (secure copy):

2.1 sftp:

ftp stands for “File Transfer Protocol”, sftp is " Secure File Transfer Protocol". In other words, with sftp, a useraccount and password are required.


sftp  <your_username>@<host_name>

For the Xanadu cluster, there is a special partition for transferring data:


sftp ssun@transfer.cam.uchc.edu

1. You can then navigate to the directory where you want to take files from.
2. put and get can be used to move files from or to your computer, respectively

2.2 scp

scp can be used without logging in provided you know the exact location where your file of interest is or will go. We will primarily use sftp in this course.

# for copying TO the server
scp -r <path_to_directory> <your_username>@transfer.cam.uchc.edu:~/path/to/target/folder

You should be prompted for a password. If not, the transfer probably failed.

# for copying FROM the server
scp -r <your_username>@<host_name>:<target_directory> 

2.2.1 How do we know if transfer was complete?

There’s a program called md5 (mac) or md5sum (linux) that can help us with this. It returns a compact digital fingerprint for each file. Any change to the file will result in a different fingerprint.

on a mac:

md5 Macrophage_15min_rep1_PE1.fastq.gz

on Linux:

md5sum Macrophage_15min_rep1_PE1.fastq.gz

2.3 Exercise 1: Inspecting, retrieving, and checking files from server

1 Log onto the server using ssh
2 Navigate to your folder in /labs/Guertin/siyu
3 View the checksum string for the file Macrophage_15min_rep1_PE1.fastq.gz.
4 Logout and return to your home directory or open a new terminal window (command-t)
5 Transfer the file to your computer using sftp
6 Confirm that the transfer was complete

3 Running interactive sessions and batch scripts on Xanadu

Xanadu uses the SLURM (Simple Linux Utility for Resource Management) language to manage workloads.
Documentation:
https://slurm.schedmd.com/

3.1 Interactive session

3.1.1 Important: All actions that require CPU or memory should be run from an interactive session to avoid jamming the head node.

The head node is like a lobby where everyone enters a space, but actions should not be taken here. You should request a ‘room’ or interactive session in which to conduct your business.

To see if you are on the head node you can use the hostname command. If the hostname ends in .cam.uchc.edu, then you are on the head node. If the hostname is xanadu-*, then you are in an interactive session. If you are not working in an interactive session, then exit so that resources are available to other users.


hostname
#simplest way to start a session
srun --pty -p general --qos=general  bash
hostname
exit
hostname

#request 4 cores and 16 Gb of RAM (this is what I typically request)
srun --pty --qos=general -p general -c 4 --mem=16G bash

hostname
exit

To confirm that you are indeed in an interactive session, try the ‘hostname’ command and it should return xanadu-<n>

3.2 Batch scripts

Sbatch (Slurm Batch) scripts need to start with a header that contains important information for running and reporting the output of your script. sbatch-template.sh

#!/bin/bash

#SBATCH --job-name=hello.sh     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 1                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=1G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

hostname # have this at beginning of every script for  troubleshooting

<your script>

here is a random script:

#! /bin/sh
list="1 0 3 6"

for x in $list
  do
# sleep for a few minutes so it is not immediately completed!  
    sleep 30s
    echo Welcome to Batch submissions
    echo $x
  done

echo $list

Here is one way to save it as a file:

touch welcome_batch.sh
nano welcome_batch.sh

# copy and paste the batch header and script into the nano text editor 
# Ctrl + O (save--also use the text prompt to rename if needed)
# Enter
# Ctrl + X (exit)

We can add this script to our batch script and rename it welcome_batch.sh.
To run the script we use:

sbatch welcome_batch.sh

To check the status there are several options:

squeue #check the status of the entire queue

squeue -u <usrID> # check the status of all jobs by a user

squeue -j <jobID>  # Check status of specific job

If something is wrong and you want to kill a job:

scancel <jobID>

You can also check the status of the server to see how busy it is with sinfo

sinfo --format="%10P  %6t %15O %15C %15F %10m %10e %15n %30E %10u" 

3.3 In class exercise 1: running and altering batch scripts

1. Copy the sbatch-template.sh script from above into your home directory.
2 Add the welcome_batch.sh script to your batch script, and adjust the other user input (email, jobID, etc) appropriately.
3. Save and run the script. Where did the output go?
4. create a sbatch-output folder in your home directory and alter the sbatch script such that the output and errors go to this folder.
5. intentionally create an error in your script (e.g. add double quotes but don’t close them), and run the script again. Now check the error file to see an example of error outputs.
6. Use the sbatch script to run another script from your scripts folder by calling it directly. i.e. instead of having the welcome-batch script commands in the sbatch file, create another script and run that with the batch script.

4 Server operations: Using module to temporarily load functions:

We’re going to use the cutadapt and bowtie2 amongst other software. These programs are on the server, however, we need to load them into our session for use. On our own machine we made sure the programs were in our $PATH. On the server we can use the module to load and have access to specific functions during your session without altering your $PATH. This provides flexibility and specificity in running software versions.
First, view which modules are available:

module avail  

We can check specifically for sratoolkit:

module avail cutadapt

Once you find the one you’re looking for:

module load <software>  

This is convenient because you can add module functions to your shell scripts to be sure that you are using the correct software versions.

AND
You don’t have to waste time adding lots of programs to your $PATH

It is flexible because it allows you to even switch between version of software on the fly. For instance:

# load software
module load <software_v.2>   

# do something
some commands, etc.
#unload software and load another version
module unload <software_v.2>
module load <software_v.1>

# do something else
more commands, etc.

4.1 Server operations: Running jobs in the background

  • One problem with working on the server is that if you start a command and it takes a while, you’re stuck.
  • To get around this you can run the operation in the background using the ‘&’ symbol.
  • This runs the program in the background, allowing you to continue using your command prompt while the program is executed.
  • To start a command in the background just add & to the end of the command.
  • You should get a process ID number: a unique identifier that allows you to track that operation.

To see what processes are running you can use several commands:

  • top: continuous monitoring of activity
  • ps or ‘process’: shows the activity under your user and only shows that activity at that instance (snapshot).
  • qstat: when you’re on the server.
  • squeue: (covered previously) can show jobs by, user, ID, or node.

4.2 Killing jobs

To stop runaway processes or ones that are taking too long in the background, you cannot use the normal Control-c. You can, however, scancel the process using the process ID number.

# when on the Xanadu server
scancel <process_ID>

You can also run jobs in the background on your own machine. In this case, you can use kill to cancel jobs.

kill <process_ID>

You can review other ways to view activities on the server and kill processes here: http://bioinformatics.uconn.edu/unix-basics/

5 Text editors

When you are writing a script, it is usually done locally (not on the server) and either copied into a server text editor such as emacs or nano or transfered with sftp. Certain text editors are designed for scripting and can recognize code. Importantly, they do not embed the document or fonts with hidden characteristics that can cause problems when running programs on you computer There are three features that you should look for in an editor:

1. language specific highlighting   2. line number display   3. version control

MAC USERS: Download BBedit here: http://www.barebones.com/products/bbedit/download.html?ref=tw_alert: http://www.barebones.com/products/bbedit/download.html?ref=tw_alert and then install it:

Open your text editor: gedit on Linux or BBEdit or textEdit on Mac.

Note: You can also use nano or other command line editors such as emacs or vim. I can advise in nano or emacs, but limited experience with vim.

A quick primer for nano is:

touch filename.sh

nano filename.sh

# write some code

ctrl-O save or use backspace to save as another name

make edits

ctrl-O to save

ctrl-X to exit

The first line is the Shebang line:

#! /bin/sh

or sometimes

#! /usr/bin/sh

to find out where your shell is type:

which sh

Let’s try a simple script:

ls -la
echo "Above are the directory listings for the following folder"
pwd

Create a new folder in your folder called ‘scripts’
Save your script as ls.sh
Go to the directory where ls.sh is and try to execute it:

./ls.sh

In order to run this script we need to give the proper permissions. To see permissions for files, type:

ls -la

The columns are:

1. permissions
2. owner
3. group
4. size
5. last modifiation
6. name  

In permissions: ‘d’=directory, ‘-’ = file, ‘r’ = read, ‘w’ = write, ‘x’ = execute.
Three sets of permissions are shown: User (owner), Group, Other users.

To give permission to execute type:

chmod +x ls.sh

Now use ls -la to view permissions and then try to execute.

6 Aligning ATAC-seq data

6.1 Obtain the reference genomes

First we need to download the three genomes we will align to: the chrM of mm39, hg38, and mm39. I always download from UCSC, Google “mm39 downloads ucsc” or the alike to get here: https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/

In order to get the chrM file you need to get the mm39.chromFa.tar.gz file and unzip/untar it. Then make a directory for your mm39 references, move chrM.fa into this directory, then delete everything we don’t need. Navigate to mm39 and download the full genome file. Then do the same for hg38—see if you can get this link from poking around UCSC.


wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.chromFa.tar.gz

tar -xzf mm39.chromFa.tar.gz

mkdir mm39

mv chrM.fa mm39 

rm chr*.fa
rm mm39.chromFa.tar.gz

cd mm39
ls
wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz

cd ..

mkdir hg38

cd hg38

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz

6.2 Index reference genomes

Bowtie 2 is an efficient tool for aligning sequencing reads to long reference sequences. The first task is to build the genome index with bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer. This only has to be performed once per genome, but we are aligning to three separate genomes in this workflow.


module load bowtie2

gunzip hg38.fa.gz 

bowtie2-build hg38.fa hg38

cd ../mm39
bowtie2-build chrM.fa chrM_mm39
gunzip mm39.fa.gz
bowtie2-build mm39.fa mm39

6.3 Clip off adapter sequences

As Illumina reads become longer, the more likely it becomes that a read sequences into the adapter used to generate the library. A first step in most molecular genomics analyses is adapter clipping. There are many tools that accomplish this task, but we typically use cutadapt.

As an aside, there are more efficient ways to analyze data than writing a loop to go through the files in series. However, this is what I typically do in vignettes to make things simple. The quickest way I know of executing all these looped commands on all the files in parallel is to write a loop to go through all the files, pass the variable name $name as an argument to a script that generates a new sbatch script for each file set (PE1 and PE2), and execute the newly created sbatch script within the loop. I can show you an example if you’d like.

# in the home directory, create the ".bashrc" file
# open the file and write:
export PATH="$HOME/bin:$PATH"
export PATH=$PATH:~/excutive/file/directory
# each time when open a new Xanadu connection, activate the .bashrc before the commands:
source ~/.bashrc

Another aside: these are big files. If you want to make sure the scripts and commands are working, I recommend heading the first 400,000 lines of each PE file (zcat Macrophage_rep1_PE1.fastq.gz | head -4000000 > test_PE1.fastq), zipping, then running the commands only on these files so they finish quickly.

/home/FCAM/mguertin/software/fastq_pair

module load cutadapt

for fq in *PE1.fastq.gz
do
    name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
    echo $name
    gunzip $fq
    gunzip ${name}_PE2.fastq.gz
    cutadapt -j 2 -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE1.fastq -o ${name}_PE1_no_adapt.fastq
    cutadapt -j 2 -m 10 -O 1 -a CTGTCTCTTATACACATCT ${name}_PE2.fastq -o ${name}_PE2_no_adapt.fastq
    gzip ${name}_PE1.fastq
    gzip ${name}_PE2.fastq
done

6.4 Re-pairing paired end files

cutadapt will discard reads with very short inserts, so the next step involves repairing the PE1 and PE2 files with fastq_pair. You can either download and install this, or you should be able to copy or execute the version I download and compiled onto Xanadu: /home/FCAM/mguertin/software/fastq_pair. I recommend putting it in a software directory and adding this directory to your $PATH—please let me know if you need a tutorial on the $PATH variable and your .bash_profile and .bashrc files. Next we will zip all the output files and move on to alignment with bowtie2.

for fq in *PE1.fastq.gz
do
    name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
    echo $name
    fastq_pair ${name}_PE1_no_adapt.fastq ${name}_PE2_no_adapt.fastq
    gzip ${name}_PE1_no_adapt.fastq
    gzip ${name}_PE2_no_adapt.fastq
    gzip ${name}_PE1_no_adapt.fastq.paired.fq
    gzip ${name}_PE2_no_adapt.fastq.paired.fq
done

6.5 Sequential alignments

The alignment steps include aligning to the mitochondrial chromosome, taking everything that does not align, aligning that to hg38, taking everything that does not align, then aligning that to mm39. There are some housekeeping commands included below, like pairing, zipping, and removing duplicate PE reads (i.e. each member of a pair align to the exact positions as another pair).

the -p $ncore option says to use 6 cores, but you are welcome to generate an sbatch script and ask for 30+ cores if you want the process to go quicker. A lot of the commands are hard coded, but it is better practice to initialize all the variable names in the header and pass the variables as arguments.

module load bowtie2
module load samtools
module load bedtools 

ncore=6

for fq in *PE1.fastq.gz
do
    name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
    echo $name
# align the chrM first, convert to BAM file and remove duplicates    
    bowtie2 -p $ncore --maxins 800 -x mm39/chrM_mm39 -1 ${name}_PE1_no_adapt.fastq.paired.fq.gz -2 ${name}_PE2_no_adapt.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.bam
    samtools index $name.bam
# extract all reads that do not align and generate FASTQs   
    samtools view -b $name.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.chrM_mm39.fastq -fq2 ${name}_PE2.chrM_mm39.fastq
    fastq_pair ${name}_PE1.chrM_mm39.fastq ${name}_PE2.chrM_mm39.fastq
    gzip ${name}_PE1.chrM_mm39.fastq.paired.fq
    gzip ${name}_PE2.chrM_mm39.fastq.paired.fq
    gzip ${name}_PE1.chrM_mm39.fastq
    gzip ${name}_PE2.chrM_mm39.fastq
# align to hg38 to get rid of the Jurkat reads    
    bowtie2 -p $ncore --maxins 800 -x hg38/hg38 -1 ${name}_PE1.chrM_mm39.fastq.paired.fq.gz -2 ${name}_PE2.chrM_mm39.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.hg38.bam
    samtools index $name.hg38.bam
# makes FASTQs out of everyhting that does not align    
    samtools view -b $name.hg38.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.hg38.fastq -fq2 ${name}_PE2.hg38.fastq
    fastq_pair ${name}_PE1.hg38.fastq ${name}_PE2.hg38.fastq
    gzip ${name}_PE1.hg38.fastq.paired.fq
    gzip ${name}_PE2.hg38.fastq.paired.fq
    gzip ${name}_PE1.hg38.fastq
    gzip ${name}_PE2.hg38.fastq
# Finally align to the mm39 genome    
    bowtie2 -p $ncore --maxins 800 -x mm39/mm39 -1 ${name}_PE1.hg38.fastq.paired.fq.gz -2 ${name}_PE2.hg38.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.mm39.bam
done

6.6 Swarming the cluster with individual jobs for each file/file set

An alternative and more efficient way to execute these commands, other than the serial loop above, is to automate the process of making a sbatch script and execute the sbatch scripts as resources on the cluster become available.

First we can generate and open a template file:

touch sbatch_fill_in_bowtie2_atac.sh
nano sbatch_fill_in_bowtie2_atac.sh

Next you can copy your template script with a string that you want replaced with a passed variable.


#!/bin/bash

#SBATCH --job-name=hello.sh     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 16                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=16G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

hostname 

fq=XXXXXXX
ncore=16

module load bowtie2
module load samtools
module load bedtools 

name=$(echo $fq | awk -F"_PE1.fastq.gz" '{print $1}')
echo $name
# align the chrM first, convert to BAM file and remove duplicates    
bowtie2 -p $ncore --maxins 800 -x mm39/chrM_mm39 -1 ${name}_PE1_no_adapt.fastq.paired.fq.gz -2 ${name}_PE2_no_adapt.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.bam
samtools index $name.bam
# extract all reads that do not align and generate FASTQs   
samtools view -b $name.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.chrM_mm39.fastq -fq2 ${name}_PE2.chrM_mm39.fastq
fastq_pair ${name}_PE1.chrM_mm39.fastq ${name}_PE2.chrM_mm39.fastq
gzip ${name}_PE1.chrM_mm39.fastq.paired.fq
gzip ${name}_PE2.chrM_mm39.fastq.paired.fq
gzip ${name}_PE1.chrM_mm39.fastq
gzip ${name}_PE2.chrM_mm39.fastq
# align to hg38 to get rid of the Jurkat reads    
bowtie2 -p $ncore --maxins 800 -x hg38/hg38 -1 ${name}_PE1.chrM_mm39.fastq.paired.fq.gz -2 ${name}_PE2.chrM_mm39.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.hg38.bam
samtools index $name.hg38.bam
# makes FASTQs out of everyhting that does not align    
samtools view -b $name.hg38.bam '*' | samtools sort -n - | bamToFastq -i - -fq ${name}_PE1.hg38.fastq -fq2 ${name}_PE2.hg38.fastq
fastq_pair ${name}_PE1.hg38.fastq ${name}_PE2.hg38.fastq
gzip ${name}_PE1.hg38.fastq.paired.fq
gzip ${name}_PE2.hg38.fastq.paired.fq
gzip ${name}_PE1.hg38.fastq
gzip ${name}_PE2.hg38.fastq
# Finally align to the mm39 genome    
bowtie2 -p $ncore --maxins 800 -x mm39/mm39 -1 ${name}_PE1.hg38.fastq.paired.fq.gz -2 ${name}_PE2.hg38.fastq.paired.fq.gz | samtools view -bS - | samtools sort -n - | samtools fixmate -m - - | samtools sort - | samtools markdup -r - $name.mm39.bam

Now you can run an interactive loop that dynamically updates the file name and submits the job.

file=sbatch_fill_in_bowtie2_atac.sh

for i in *PE1.fastq.gz
do
    nm=$(echo $i | rev | cut -f 1 -d '/' | rev | cut -f 1 -d '.')
    fq=$(echo $i | rev | cut -f 1 -d '/' | rev)
    echo $nm
    echo $fq
    sed -e "s/XXXXXXX/${fq}/g" "$file" > ${nm}_sbatch_fill_in_bowtie2_atac.sh
    sbatch ${nm}_sbatch_fill_in_bowtie2_atac.sh
done

The few annoying things that can be improved is that the fastq.gz files are currently in the working directory because passing a variable with the / character included as the string into sed causes issues. These issues can most easily be resolved by changing the sbatch_fill_in_bowtie2_atac.sh script to have directories coded in the header as variables.

One side note: leave enough space in the home directory for this step, un-zipped files are large and can take up to 600G space. use command du -sh /your/directory/ to check the space usage.

7 ATAC-seq peak calling

The next code chunk calls peaks and generates a 200 base window around the peak summit for de novo motif analysis. In the case of error message occurring: xanadu OSError: [Errno 28] No space left on device, we can set the TMPDIR environment variable (Linux, UNIX) by making a temp directory as:

mkdir -p ${HOME}/temp

then export this new temp directory as the new location for the temporary directory.


module load macs3
module load bedtools 
# export temp dir
export TMPDIR="${HOME}/temp" 

wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.chrom.sizes
name=macrophage
atacFiles=$(ls *.mm39.bam)

# call peaks
macs3 callpeak -t $atacFiles -f BAMPE -g mm -n $name -B --trackline -q 0.01 --outdir ${name}_peaks 2>&1 | tee -a Macrophage_peaks_log.txt

# remove header (start at line 2 of the file) (one edition: use tail -n +2 instead of tail +2)
tail -n +2 ${name}_peaks/${name}_summits.bed > ${name}_summits2.bed

# make a 200bp window around the summit
slopBed -l 99 -r 100 -i ${name}_summits2.bed -g mm39.chrom.sizes > ${name}_summit_window.bed

# sort
sortBed -i ${name}_summit_window.bed > ${name}_summit_window_sorted.bed

# merge
mergeBed -i ${name}_summit_window_sorted.bed > ${name}_summit_window_merged.bed

8 Converting BAM to bigWIg for counting

On Xanadu; this requires installation of program seqOutBias

module load genometools/1.5.10 # required for running seqOutBias

for i in *.mm39.bam
do
    name=$(echo $i | awk -F".mm39.bam" '{print $1}')
    echo $name
    seqOutBias mm39.fa $i --no-scale --bw=${name}.bigWig --shift-counts --skip-bed
done

9 DESeq2 counting in R

There are many versions of R available on Xanadu, but version R/4.1.2 has several of the packages we need for class preinstalled. After loading the module, typing R will take you from a command line shell that is interpreted by bash to a command line interpreted through the R language.

srun --pty --qos=general -p general -c 2 --mem=16G bash

module load R/4.1.2

R

I exclusively run R interactively and I usually run it locally, so you are welcome to take the files off the server at this point and install these libraries locally. Just make sure to back up the data if you work locally:

# load library that are already on Xanadu:
library(DESeq2)
library(lattice)
library(bigWig)

# functions for DESeq2
get.raw.counts.interval <- function(df, path.to.bigWig, file.prefix = 'M') {
    df = df[,1:3]
    vec.names = c()
    inten.df=data.frame(matrix(ncol = 0, nrow = nrow(df)))
    for (mod.bigWig in Sys.glob(file.path(path.to.bigWig,
                                          paste(file.prefix, "*.bigWig", sep ='')))) {
        factor.name = strsplit(strsplit(mod.bigWig, "/")[[1]][length(strsplit(mod.bigWig, "/")[[1]])], '\\.')[[1]][1]
        print(factor.name)
        vec.names = c(vec.names, factor.name)
        loaded.bw = load.bigWig(mod.bigWig)
        mod.inten = bed.region.bpQuery.bigWig(loaded.bw, df)
        inten.df = cbind(inten.df, mod.inten)
    }
    colnames(inten.df) = vec.names
    r.names = paste(df[,1], ':', df[,2], '-', df[,3], sep='')
    row.names(inten.df) = r.names
    return(inten.df)
}

run.deseq.list <- function(mat, untreatedreps = 3, treatedreps = 4) {
    sample.conditions = factor(c(rep("untreated", untreatedreps) , rep("treated", treatedreps)), levels=c("untreated","treated"))
    deseq.counts.table = DESeqDataSetFromMatrix(mat, DataFrame(sample.conditions), ~ sample.conditions);
    colData(deseq.counts.table)$condition<-factor(colData(deseq.counts.table)$sample.conditions, levels=c("untreated","treated"));
    dds = DESeq(deseq.counts.table);
    res = results(dds);
    res = res[order(res$padj),];
    return(res)
}


plot.ma.lattice.publishable <- function(ma.df, filename = 'file.name', adj = 0.001,
                                        title.main = "Differential Expression") {
    ma.df$group <- 'a'
    ma.df[!is.na(ma.df$padj) & !is.na(ma.df$log2FoldChange) &
          ma.df$padj < adj & ma.df$log2FoldChange < 0,]$group <- 'b'
    ma.df[!is.na(ma.df$padj) & !is.na(ma.df$log2FoldChange) &
          ma.df$padj < adj & ma.df$log2FoldChange > 0,]$group <- 'c'
    ma.df[ma.df$padj > 0.5 & !is.na(ma.df$padj) & !is.na(ma.df$log2FoldChange) &
          abs(ma.df$log2FoldChange) < 0.25,]$group <- 'd'
    pdf(paste("MA_plot_", filename, "_FDR_pval_", as.character(adj),".pdf", sep=''),
        width=3.83, height=3.83, useDingbats=FALSE)
    print(xyplot(ma.df$log2FoldChange ~ log(ma.df$baseMean, base=10), groups=ma.df$group,
                 col=c("grey85","#2290cf","#ce228e", "grey65"),
                 main=title.main, scales="free", aspect=1, pch=20, cex=0.3,
                 ylab=expression("log"[2]~"Differential ATAC"),
                 xlab=expression("log"[10]~"Mean of Normalized Counts"),
                 par.settings=list(par.xlab.text=list(cex=1.1,font=2),
                                   par.ylab.text=list(cex=1.1,font=2))))
    dev.off()
    return(ma.df)
}

process.deseq.df <- function(ma.df, filename = 'file.name', adj = 0.001, pval = NA) {
    decreased =  ma.df[ma.df$padj < adj & !is.na(ma.df$padj) & ma.df$log2FoldChange < 0,];
    increased =  ma.df[ma.df$padj < adj & !is.na(ma.df$padj) & ma.df$log2FoldChange > 0,];
    not.different =  ma.df[ma.df$pvalue > 0.2 & !is.na(ma.df$pvalue) & !is.na(ma.df$log2FoldChange) & abs(ma.df$log2FoldChange) < 0.2,];

    coor = rownames(decreased)
    coor.start = sapply(strsplit(sapply(strsplit(as.character(coor),':'), "[", 2), "-"), "[", 1);
    coor.end = sapply(strsplit(as.character(coor),'-'), "[", 2)
    coor.chr = sapply(strsplit(as.character(coor),':'), "[", 1)
    df.coor = cbind(coor.chr, coor.start, coor.end, as.character(decreased$baseMean), as.character(decreased$log2FoldChange), '+', as.character(decreased$lfcSE), as.character(decreased$pvalue), as.character(decreased$padj))
    write.table(df.coor, file = paste('decreased_', filename, '_', adj, '_FDR.bed', sep =''), sep = '\t', quote=F, row.names=F, col.names=F)
    
    coor = rownames(increased)
    coor.start = sapply(strsplit(sapply(strsplit(as.character(coor),':'), "[", 2), "-"), "[", 1);
    coor.end = sapply(strsplit(as.character(coor),'-'), "[", 2)
    coor.chr = sapply(strsplit(as.character(coor),':'), "[", 1)
    df.coor = cbind(coor.chr, coor.start, coor.end, as.character(increased$baseMean), as.character(increased$log2FoldChange), '+', as.character(increased$lfcSE), as.character(increased$pvalue), as.character(increased$padj))
    write.table(df.coor, file = paste('increased_', filename, '_', adj, '_FDR.bed', sep =''), sep = '\t', quote=F, row.names=F, col.names=F)

    coor = rownames(not.different)
    coor.start = sapply(strsplit(sapply(strsplit(as.character(coor),':'), "[", 2), "-"), "[", 1);
    coor.end = sapply(strsplit(as.character(coor),'-'), "[", 2)
    coor.chr = sapply(strsplit(as.character(coor),':'), "[", 1)
    df.coor = cbind(coor.chr, coor.start, coor.end, as.character(not.different$baseMean), as.character(not.different$log2FoldChange), '+', as.character(not.different$lfcSE), as.character(not.different$pvalue), as.character(not.different$padj))
    write.table(df.coor, file = paste('notDifferent_', filename, '_', adj, '_FDR.bed', sep =''), sep = '\t', quote=F, row.names=F, col.names=F)
}

#running processes
# define variables:
directory = "/full/path/to/the/bigwig/files"
MAC.file = read.table("macrophage_summit_window_merged.bed") 

#only count on well annotated chromosomes:
chr.keep = paste0("chr",c(1:19))
MAC.file = MAC.file[MAC.file[,1] %in% c(chr.keep, 'chrX', 'chrY'),]

# count reads afalling into your peaks
df.MAC = get.raw.counts.interval(MAC.file, directory, file.prefix = 'MAC')

# 
head(df.MAC)
save(df.MAC, file = "df.MAC.Rdata")

10 Exploration

10.1 format

# load library:
library(DESeq2)
library(DEGreport)
library(tibble)
library(lattice)
library(tidyr)
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
#load the count table that contains gene counts across all conditions.
load("/directory/to/df.MAC.Rdata")
sample.conditions = c(sapply(strsplit(as.character(colnames(df.MAC)), '_'), '[', 2)[1:12], rep("t0",4))
sample.conditions = factor(sample.conditions, levels=c("t0","15min","30min","45min"))

deseq.counts.table = DESeqDataSetFromMatrix(df.MAC, as.data.frame(sample.conditions), ~ sample.conditions)
dds = DESeq(deseq.counts.table)

normalized.counts = counts(dds, normalized=TRUE)
save(normalized.counts,file='normalized.counts.atac.Rdata')

10.2 PCA

# generate PCA plot
rld = rlog(dds, blind=TRUE)
x = plotPCA(rld, intgroup="sample.conditions", returnData=TRUE)
plotPCAlattice(x, file = 'PCA_Macrophage_lattice_guertin.pdf')

PCA_Macrophage

10.3 LRT

#lrt
dds.lrt = DESeq(dds, test="LRT", reduced = ~ 1)
res.lrt = results(dds.lrt)

10.3.1 decide the p adjusted value (FDR) cut-off

library(dplyr)
library(tidyr)
#decide a padj cutoff
FDR<-as.data.frame(res.lrt$padj)
FDR$DAregions<-rownames(res.lrt)

pads<-seq(0,210, 0.05)
FDR$log10padj<-log10(FDR$`res.lrt$padj`)
FDR$log10padjround<-round(log10(FDR$`res.lrt$padj`), digits=2)

FDR.modi<-na.omit(FDR) %>%count(log10padjround)

for (i in 2:nrow(FDR.modi)){
  FDR.modi$accum_n[1]=FDR.modi$n[1]
  FDR.modi$accum_n[i]=FDR.modi$n[i]+FDR.modi$accum_n[(i-1)]
}
plot(FDR.modi$log10padjround, FDR.modi$accum_n, xlim=c(-12,0), xaxt='n',
   xlab="log10padjround", ylab="accumulative number of DA regions", pch=19)
axis(side=1, at=seq(-100, 0, by=0.5))
abline(v=-1.3467, col="red", lty=2, lwd=3)

#the closer to 0, the higher the padj

P_adj_cutoff0.045

10.3.2 use the cut-off to extract data that we are interested of

padj.cutoff = 0.045 

siglrt.re = res.lrt[res.lrt$padj < padj.cutoff & !is.na(res.lrt$padj),]

dim(siglrt.re)                                    

rld_mat <- assay(rld)
cluster_rlog = rld_mat[rownames(siglrt.re),]
meta = as.data.frame(sample.conditions)
rownames(meta) = colnames(cluster_rlog)

save.image('220609_macrophage.Rdata') # year==22, month==06, day ==09
save(cluster_rlog, meta, sample.conditions, file = 'cluster_rlog_pval_0.045.Rdata')

10.3.3 ATAC clustering

running on Xanadu:

module load gcc R
library(DESeq2)
library(DEGreport)
library(tibble)
library(lattice)
load('cluster_rlog_pval_0.045.Rdata')
     
clusters.all.test.0.045 <- degPatterns(cluster_rlog, metadata = meta, minc = 100, time = "sample.conditions", col=NULL, eachStep = TRUE)
save(clusters.all.test.0.045, file = '220609_clusters.all.minc100.pval0.045.Rdata')

Note that degPatterns doesn’t calculate significant difference between groups, so the matrix used as input should be already filtered to contain only genes that are significantly different or the most interesting genes to study.

minc: integer minimum number of genes in a group that will be return.

#load data
load('220609_clusters.all.minc100.pval0.045.Rdata') 
#using plotiing_clusters_LRT.R to optimize plotting functions with a small data frame

plot.df = clusters.all.test.0.045$normalized 

plot.df$sample.conditions = as.character(plot.df$sample.conditions)
plot.df$sample.conditions[plot.df$sample.conditions == 't0'] = 0
plot.df$sample.conditions[plot.df$sample.conditions == '15min'] = 15
plot.df$sample.conditions[plot.df$sample.conditions == '30min'] = 30
plot.df$sample.conditions[plot.df$sample.conditions == '45min'] = 45

plot.df$sample.conditions = as.numeric(plot.df$sample.conditions)
plot.df = plot.df[order(plot.df$genes),]
plot.df = plot.df[order(plot.df$sample.conditions),]

plot.df$cluster = paste('cluster', as.character(plot.df$cluster), sep = '')

plot.df$chr = sapply(strsplit(plot.df$genes, '[.]'), '[', 1)
plot.df$start = sapply(strsplit(plot.df$genes, '[.]'), '[', 2)
plot.df$end = sapply(strsplit(plot.df$genes, '[.]'), '[', 3)

10.3.4 dynamic peaks

save(plot.df,file='plot.df.Rdata')

write.table(cbind(plot.df$chr, plot.df$start, plot.df$end), 
            file = 'dynamic_peaks.bed', quote = FALSE, 
            sep = '\t', col.names=FALSE, row.names=FALSE)

10.3.5 nondynamic peaks

#Generate nondynamic peaks set for FIMO.
not.different = rownames(res.lrt[res.lrt$padj > 0.5 & !is.na(res.lrt$padj) & !is.na(res.lrt$log2FoldChange) & abs(res.lrt$log2FoldChange) < 0.25,])
#not.different = rownames(res.lrt[res.lrt$padj > 0.1 & !is.na(res.lrt$padj) & !is.na(res.lrt$baseMean) & res.lrt$baseMean > 10,])
chr = sapply(strsplit(not.different, ':'), '[', 1)
x = sapply(strsplit(not.different, ':'), '[', 2)
start = sapply(strsplit(x, '-'), '[', 1)
end = sapply(strsplit(x, '-'), '[', 2)

curated.not.different = data.frame(chr,start,end)
write.table(curated.not.different,file='nondynamic_peaks.bed',sep='\t',col.names=F,row.names=F,quote=F)

10.3.6 all.other peaks

in xanadu:

module load gcc bedtools  R

#collect ATAC peaks that were not sorted into dynamic/nondynamic peaks
intersectBed -v -a macrophage_summit_window.bed   #Only report those entries in A that have _no overlaps_ with B.
        
                -b *peaks.bed > all_other_peaks.bed #first three column
all_other<-unique(read.table("all_other_peaks.bed"))
write.table(all_other[,1:3],file='allother_peaks.bed',sep='\t',col.names=F,row.names=F,quote=F)

10.3.7 plot

10.3.7.1 (1)plot individual clusters

#just plot loess

loess.df = data.frame(matrix(ncol = 3, nrow = 0),  stringsAsFactors = FALSE)
colnames(loess.df) = c('ATAC','time','cluster')
for (i in unique(plot.df$cluster)){
    ATAC = loess.smooth(plot.df[plot.df$cluster == i,]$sample.conditions, plot.df[plot.df$cluster == i,]$value,  span = 1/2, degree = 1, family = c("gaussian"))$y
    time = loess.smooth(plot.df[plot.df$cluster == i,]$sample.conditions, plot.df[plot.df$cluster == i,]$value, span = 1/2, degree = 1, family = c("gaussian"))$x
    new = as.data.frame(cbind(ATAC, time, i),  stringsAsFactors = FALSE)
    colnames(new) = c('ATAC','time','cluster')
    new$ATAC = as.numeric(new$ATAC)
    new$time = as.numeric(new$time)
    loess.df = rbind(loess.df, new)
}
colors.clusters = rep('white', 7)
colors.clusters[c(4,6)] = 'blue' # transient decrease
colors.clusters[1] = 'cornflowerblue' #decrease-increase-decrease

colors.clusters[c(3,7)] = 'red' #transient increase
colors.clusters[5] = 'tomato3'  #gradual increase
colors.clusters[2] = 'lightsalmon' #increase-decrease-increase
pdf(paste('Clusters_individual_loess','.pdf', sep=''), width=4, height=4)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
xyplot(ATAC ~  time, groups = cluster, data = loess.df, type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       auto.key = T,
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       ylab = list(label = 'cluster loess of ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=colors.clusters, cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = colors.clusters, lwd=c(3),
                                                 lty = c(1))))

      )
key=list(space="right",
         lines=list(col=c("blue", "cornflowerblue", "red", "tomato3", "lightsalmon")), 
         lty=c(3,2), 
         lwd=6,
         text=list(c("cluster1", "cluster4,6", "cluster3,7", "cluster5", "cluster2"))
)


dev.off()

dynamic_cluster

10.3.7.2 (2)plot all traces

#this has a bwplot, loess curve, points, and lines, which can be removed from the panel function to customize
# Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression. Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced. 
pdf(paste('Clusters_minc100_stringent_pvalue','.pdf', sep=''), width=11, height=15) 

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
xyplot(value ~  sample.conditions | cluster, group = genes, data = plot.df, type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       ylab = list(label = 'Normalized ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = c('#99999980'), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15, do.out = FALSE)
           panel.loess(x, y, ..., col = "blue", lwd =2.0,  span = 1/2, degree = 1, family = c("gaussian"))
           
})

      )
dev.off()

dynamic_cluster

10.3.7.3 (3)plot dendrogram

generate the dendrogram to confirm if we could merge some clusters

library(data.table)
x = as.data.table(plot.df)
plot.df.cluster = dcast(x, genes + cluster ~ sample.conditions, value.var="value")

avg.clusters = as.data.frame(matrix(nrow = 0, ncol = 4))
for (i in unique(plot.df.cluster$cluster)) {
    z = data.frame(matrix(colMeans(plot.df.cluster[plot.df.cluster$cluster == i,3:6]), 
                            ncol = 4, nrow = 1))
    rownames(z) = c(i)
    colnames(z) = as.character(colnames(plot.df.cluster)[3:6])
    avg.clusters = rbind(avg.clusters, z)
}


dd = dist(avg.clusters)
hc = hclust(dd, method = "complete")

pdf('dendrogram.pdf', width=8, height=5)
plot(hc, xlab = "Clusters", main = ' ', hang = -1)
abline(h = 1.5, lty = 2)
dev.off()

dynamic_cluster

10.3.7.4 (4)plot superclusters

cluster3 & 7

pdf(paste('Clusters_3_7','.pdf', sep=''), width=3.5, height=3.5, useDingbats =FALSE)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
    xyplot(value ~  sample.conditions , group = genes, data =
                       plot.df[plot.df$cluster == 'cluster3' | 
    plot.df$cluster == 'cluster7',]
         , type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       ylab = list(label = 'Normalized ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = c('#99999980'), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15, do.out = FALSE)
           panel.loess(x, y, ..., col = "blue", lwd =2.0,  span = 1/2, degree = 1, family = c("gaussian"))
           
})

      )
dev.off()

dynamic_cluster

cluster 4 & 6

pdf(paste('Clusters_4_6','.pdf', sep=''), width=3.5, height=3.5, useDingbats =FALSE)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
    xyplot(value ~  sample.conditions , group = genes, data =
                       plot.df[plot.df$cluster == 'cluster4' | 
    plot.df$cluster == 'cluster6',]
         , type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45), y =list(cex=1.0, relation="free")),
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       ylab = list(label = 'Normalized ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = c('#99999980'), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15, do.out = FALSE)
           panel.loess(x, y, ..., col = "blue", lwd =2.0,  span = 1/2, degree = 1, family = c("gaussian"))
           
})

      )
dev.off()

dynamic_cluster #### (5)plot all superclusters

# increase dynamics:
gradual.up = plot.df[plot.df$cluster == 'cluster5',]

up.down = plot.df[plot.df$cluster == 'cluster3' |
                    plot.df$cluster == 'cluster7' ,]

up.down.up = plot.df[plot.df$cluster == 'cluster2',]

#decrease dynamics:

down.up = plot.df[plot.df$cluster == 'cluster4'|
                    plot.df$cluster == 'cluster6',]

down.up.down = plot.df[plot.df$cluster == 'cluster1',]


gradual.up$supercluster = 'gradual.up'
up.down$supercluster = 'up.down'
up.down.up$supercluster = 'up.down.up'
down.up$supercluster = 'down.up'
down.up.down$supercluster = 'down.up.down'


plot.df.atac = rbind(gradual.up,
      up.down,
      up.down.up,
      down.up,
      down.up.down)


plot.df.atac = plot.df.atac[,2:16]
plot.df.atac$genes = paste0(plot.df.atac$chr,':',plot.df.atac$start,'-',plot.df.atac$end)
save(plot.df.atac,file='plot.df.atac.Rdata')
plot.df.atac$genes = as.factor(plot.df.atac$genes)
cat.colours = plot.df.atac[plot.df.atac$merge == 'one_groupt0',]
cat.colours$genes <- as.factor(cat.colours$genes)
cat.colours$supercluster <- as.factor(cat.colours$supercluster)

cat.colours$colour[cat.colours$supercluster == 'up.down.up'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'up.down'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'gradual.up'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'down.up.down'] <- '#0000FF08'
cat.colours$colour[cat.colours$supercluster == 'down.up'] <- '#0000FF08'

cat.colours$colour <- as.factor(cat.colours$colour)

cat.colours <- cat.colours[match(levels(plot.df.atac$genes), cat.colours$genes), ]
pdf('atac_superclusters.pdf', width=6.83, height=5)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))
print(
xyplot(value ~  sample.conditions | supercluster, group = genes,
       data = plot.df.atac, type = c('l'),#type = c('l','p'),
       scales=list(x=list(cex=1.0,relation = "free", rot = 45),
       y =list(cex=1.0, relation="free")),
       aspect=1.0,
       between=list(y=0.5, x=0.5),
       layout = c(5,1),
       ylab = list(label = 'Normalized ATAC signal', cex =1.0),
       xlab = list(label = 'Time (minutes)', cex =1.0),
       par.settings = list(superpose.symbol = list(pch = c(16),
                                                   col=c('grey20'), cex =0.5),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1),
                                                 lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15,
                           do.out = FALSE)
           panel.spline(x, y, col = 'black', lwd =2.0, ...) 
           
})

      )
dev.off()

dynamic_cluster down.up: cluster 4 & 6 down.up.down: cluster 1 gradual.up: cluster 5 up.down: cluster 3 & 7 up.down.up: cluster 2

11 Motif analysis

11.1 MEME motif enrichment analysis

for (i in unique(plot.df$cluster)) {
    print(i)
    write.table(plot.df[plot.df$cluster == i,
                        c('chr','start','end', 'value', 'cluster')][!duplicated(plot.df[plot.df$cluster == i,]$genes),],
                file = paste0('cluster_bed_',
                              gsub(" ", "", i, fixed = TRUE),'.bed'),
                quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
}

11.1.1 run MEME on each cluster.bed files to find enriched motifs

running on Xanadu:

wget https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.chrom.sizes
mkdir meme_motif_enrichment

for i in *cluster*bed
do
    name=$(echo $i | awk -F"_" '{print $NF}' | awk -F".bed" '{print $1}')
    echo $name
    echo '#SBATCH -o' $name'.meme.out' > temp.txt
    echo 'i='$i > temp2.txt
    cat meme_slurm_header_1.txt temp.txt \
        meme_slurm_header_2.txt temp2.txt \
        meme_slurm_header_3.txt > $name.meme.sh
    sbatch $name.meme.sh                                                                                                                                                           
    rm temp.txt
    rm temp2.txt
done

(1)meme_slurm_header_1.txt

#!/bin/bash

#SBATCH --job-name=meme1        # name for job
#SBATCH -N 1                    # number of nodes
#SBATCH -n 1                    # number of jobs / tasks
#SBATCH -c 48                    # number of cores
#SBATCH --mem=128G                # RAM (memory) requested

(2)meme_slurm_header_2.txt

#SBATCH -p general           # SLURM partition
#SBATCH --qos=general        # SLURM Quality of service
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

hostname # have this at beginning of every script for  troubleshooting

module purge

module load gcc bedtools/2.26.0 meme/5.4.1

(3)meme_slurm_header_3.txt

cd meme_motif_enrichment
name=$(echo $i | awk -F"_" '{print $NF}' | awk -F".bed" '{print $1}')
echo $name

slopBed -i ../$i \
        -g ../mm39.chrom.sizes -b -50 | \
        fastaFromBed -fi ../mm39.fa \
                     -bed stdin -fo ${name}.fasta
head ${name}.fasta
 meme -p 59 -oc ${name}.meme_output -nmotifs 15 \
          -objfun classic -evt 0.01 -searchsize 0 -minw 6 \
          -maxw 18 -revcomp -dna -markov_order 3 -maxsize 100000000  \
          ${name}.fasta

cd ..

11.1.2 Prepare MEME databases

#Generate homer_uniprobe_jaspar PSWM database

wget http://meme-suite.org/meme-software/Databases/motifs/motif_databases.12.19.tgz
tar -xzf motif_databases.12.19.tgz
#JASPAR
mv motif_databases/JASPAR/JASPAR2018_CORE_vertebrates_non-redundant.meme $PWD
#Uniprobe
mv motif_databases/MOUSE/uniprobe_mouse.meme $PWD

#Homer
#CAUTION: the HOMER_MEME_conversion.py was written for python2 so remember to specify python2.7 if you have python3 installed.
wget https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/HOMER_MEME_conversion.py
wget http://homer.ucsd.edu/homer/custom.motifs
python2.7 HOMER_MEME_conversion.py -i custom.motifs -o homer.motifs

#edit databases to work with tomtom
cp JASPAR2018_CORE_vertebrates_non-redundant.meme JASPAR_edited_meme.txt
grep MOTIF JASPAR_edited_meme.txt > motifs.txt
cat motifs.txt | while read motif
do
    name=$(echo $motif | awk -F" " '{print $NF}')
    temp=$(echo 'MOTIF' $name'_jaspar')
    echo $temp
    sed -i "s;${motif};${temp};g" JASPAR_edited_meme.txt
done
rm motifs.txt

#edit databases to work with tomtom
cp uniprobe_mouse.meme uniprobe_edited_meme.txt
grep MOTIF uniprobe_edited_meme.txt > motifs.txt
cat motifs.txt | while read motif
do
    name=$(echo $motif | awk -F" " '{print $NF}')
    temp=$(echo 'MOTIF' $name'_uniprobe')
    echo $temp
    sed -i "s;${motif};${temp};g" uniprobe_edited_meme.txt
done
#remove 'secondary' motifs
sed -i -e '4210,8365d;' uniprobe_edited_meme.txt

rm motifs.txt

#edit databases to work with tomtom
cp homer.motifs_meme.txt homer_edited_meme.txt
grep MOTIF homer_edited_meme.txt > motifs.txt
cat motifs.txt | while read motif
do
    name=$(echo $motif | awk -F" " '{print $NF}')
    name=$(echo $name | awk -F"/" '{print $1}')
    temp=$(echo 'MOTIF' $name'_homer')
    echo $temp
    sed -i "s;${motif};${temp};g" homer_edited_meme.txt
done
rm motifs.txt

#Collect all database motifs into one file
cat homer_edited_meme.txt \
    uniprobe_edited_meme.txt JASPAR_edited_meme.txt > homer_uniprobe_jaspar_edited.txt

11.1.3 Query MEME motifs in TOMTOM

grouping_motif_families1.sh

#!/bin/bash

#SBATCH --job-name=grouping_motif     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 4                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=32G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

module purge
module load gcc meme/5.4.1

for i in *output/meme.txt
do
    name=$(echo $i | awk -F"/me" '{print $1}' | awk -F".meme" '{print $1}')
    echo $name
    tomtom -no-ssc -o $name.tomtom_output -verbosity 1 -min-overlap 5 -dist ed -evalue -thresh 0.05 $name.meme_output/meme.txt homer_uniprobe_jaspar_edited.txt
    tomtom -no-ssc -o $name.tomtom_output -verbosity 1 -min-overlap 5 -dist ed -text -evalue -thresh 0.05 $name.meme_output/meme.txt homer_uniprobe_jaspar_edited.txt > $name.tomtom_output/tomtom.txt
done


mkdir motifid_clusters
cd motifid_clusters

echo 'Extracting Motif Names'
for file in ../*tomtom_output/*.txt
do
    name=$(echo $file | awk -F"/" '{print $(NF-1)}' | awk -F".tomtom_output" '{print $1}')
    echo $name
    motifid=$name
    #echo $file
    linenum=$(awk 'END {print NR}' $file)
    #echo $linenum
    first=2
    i=$first
    if [[ $linenum != 5 ]]
    then
    mkdir $name
    while [[  $i -le $linenum ]]
    do
        head -$i $file | tail -1 > lastline
        mapid[$i]=$(awk 'END {print $2}' lastline | awk -F"(" '{print $1}')
        #echo mapid[$i]_${mapid[$i]}
        echo ${mapid[$i]} >> $name/motifidlist_$motifid.txt
        ((i = i + 1))
    done
    head -n $(( $(wc -l $name/motifidlist_$motifid.txt | awk '{print $1}') - 4 )) $name/motifidlist_$motifid.txt > $name/motifidlist_$motifid.final.txt
    rm $name/motifidlist_$motifid.txt
    mv $name/motifidlist_$motifid.final.txt $name/motifidlist_$motifid.txt
    fi
done
rm lastline
cat */motifidlist*txt > all_matched_motifs.txt
mv all_matched_motifs.txt ..
cd ..

11.2 defining motif families

11.2.1 Query factors

mkdir individual_memes
#pulls each de novo motif from all the clusters and saves each as its own file

#extract individual meme files from combined database
#CAUTION: you only need to run this once, even if you're working through the code again
cd individual_memes
wget https://raw.githubusercontent.com/guertinlab/adipogenesis/master/motif_analysis/MEME_individual_from_db.py
python2.7 MEME_individual_from_db.py -i ../homer_uniprobe_jaspar_edited.txt


for file in *meme.txt 
do
    name=$(echo $file | awk -F"homer_uniprobe_jaspar_edited.txt_" '{print $1}')
    mv $file ${name}meme.txt
    
done

cd ..
sort all_matched_motifs.txt | uniq -u > all_matched_motifs_uniq.txt



#tomtom all query factors against all others
mkdir tomtom_all_query_factors
cd tomtom_all_query_factors

cat ../all_matched_motifs_uniq.txt | while read factor
do
    echo $factor
    cp ../individual_memes/${factor}_meme.txt $PWD
done

cat *meme.txt > ../all_query_factors_meme.txt

module load gcc meme/5.4.1

for meme in *meme.txt
do
    name=$(echo $meme | awk -F".txt_" '{print $NF}' | awk -F"_meme.txt" '{print $1}')
    #echo $name
    tomtom -no-ssc -o $name.tomtom_output -verbosity 1  -incomplete-scores -min-overlap 1 -dist ed -evalue -thresh 0.0005 $meme ../all_query_factors_meme.txt
    cd $name.tomtom_output
    cut -f1,2,5 tomtom.tsv | tail -n +2 | sed '$d' | sed '$d' | sed '$d' | sed '$d' >> ../3_col_combined_motif_db_pre.txt
    cd ..
done





grep -v '#' 3_col_combined_motif_db_pre.txt > 3_col_combined_motif_db.txt
rm 3_col_combined_motif_db_pre.txt
cp 3_col_combined_motif_db.txt ..

11.2.2 Visualize PSWM

A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.

PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery.

run this: https://raw.githubusercontent.com/guertinlab/adipogenesis/master/Pipeline_ATAC/4.2_visualize_pswms.R

#install.packages("igraph")
#install.packages("dichromat")
library(igraph)
library(dichromat)

#setwd('/your/directory/to/3_col_combined_motif_db.txt')

threecol=read.table("3_col_combined_motif_db.txt",header=F,stringsAsFactors = F,sep='\t')
colnames(threecol)=c('from','to','e_value')
threecol$weight=abs(log(threecol$e_value))

#create the graph variable
g=graph.data.frame(threecol,directed=F)
g=simplify(g)

cluster=clusters(g)

for(i in 1:length(groups(cluster))) {
write.table(groups(cluster)[i],file=paste0('PSWM_family_',i,'.txt'), 
    col.names = F, row.names = F, quote = F, sep = '\t')
}

l=layout.fruchterman.reingold(g)
l=layout.norm(l,-1,1,-1,1)

colfunc<-colorRampPalette(c("red","yellow","springgreen","royalblue","purple")) # 9 families
#pick a distinct color 
mycol = colfunc(length(groups(cluster)))

pdf(file='families.pdf',width=10,height=10)
plot(g,layout=l,rescale=F,vertex.label.cex=.5,xlim=range(l[,1]),  ylim=range(l[,2]),
edge.width=E(g)$weight/20,vertex.size=degree(g,mode='out')/5,
edge.curved=T,vertex.label=NA,vertex.color=mycol[cluster$membership],
margin=0,asp=0)
dev.off()

11.3 Generating composite motifs

11.3.1 Download required scripts

#!/bin/bash

wget https://raw.githubusercontent.com/guertinlab/adipogenesis/master/Pipeline_ATAC/misc_scripts/tomtom_output_to_composite.py
wget https://github.com/guertinlab/atac_rotation/blob/main/generate_composite_motif.R

wget https://raw.githubusercontent.com/guertinlab/adipogenesis/master/Pipeline_ATAC/misc_scripts/meme_header.txt

11.3.2 Query TOMTOM to calculate index and values for PSWM

#!/bin/bash

mkdir composite_motifs
cd composite_motifs

#query tomtom for each factor against all others
module purge
module load gcc meme/5.4.1


for txt in ../PSWM_family_*.txt
do
    dir_name=$(echo $txt | awk -F'../' '{print $2}' | awk -F'.txt' '{print $1}')
    echo $dir_name
    mkdir $dir_name
    cd $dir_name
    
    cat ../$txt | while read line
    do
echo $line
cp ../../individual_memes/${line}_meme.txt $PWD
    done
    echo ''
    query_factor=`head -1 ../$txt`
    if [[ $(wc -l < ../$txt) -ge 2 ]]
    then    
cat ../$txt | { while read line
do
query_factor=$line
rm ref_factors_meme.txt
mv ${query_factor}_meme.txt ..
cat *_meme.txt > ref_factors_meme.txt
mv ../${query_factor}_meme.txt $PWD
tomtom -no-ssc -o ${query_factor}.tomtom_output -verbosity 1 -incomplete-scores -min-overlap 1 -dist ed -evalue -thresh 0.0005 ${query_factor}_meme.txt ref_factors_meme.txt                
if [[ $(wc -l < ${query_factor}.tomtom_output/tomtom.tsv) -ge $max_motif ]]
then
max_motif=$(wc -l < ${query_factor}.tomtom_output/tomtom.tsv)
final_query=$query_factor
fi
done
echo FINAL_QUERY IS $final_query
wc -l ${final_query}.tomtom_output/tomtom.tsv
cd ${final_query}.tomtom_output
python2.7 /home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/tomtom_output_to_composite.py -i tomtom.xml
mv tomtom.xml_test_index_pswm.txt ../composite.values.txt
mv tomtom.xml_test_index_rc_offset.txt ../composite.index.txt
cd ../..
}
fi
    cd ..
done

11.3.4 TOMTOM the composite to the database

run tomtom on composite meme output

#!/bin/bash

module purge
module load gcc meme/5.4.1

mkdir composite_tomtom
# tomtom to database

for family in composite_motifs/PSWM*
do
    name=$(echo $family | awk -F"composite_motifs/PSWM_" '{print $NF}')
    echo $name
    tomtom -no-ssc -o composite_tomtom/$name.tomtom_output -verbosity 1 -min-overlap 5 -dist ed -evalue -thresh 0.05 composite_motifs/PSWM_${name}/*${name}_meme.txt homer_uniprobe_jaspar_edited.txt
    tomtom -no-ssc -o composite_tomtom/$name.tomtom_output -verbosity 1 -min-overlap 5 -dist ed -text -evalue -thresh 0.05 composite_motifs/PSWM_${name}/*${name}_meme.txt homer_uniprobe_jaspar_edited.txt > composite_tomtom/$name.tomtom_output/tomtom.txt
done

11.3.5 Isolate the top database PSWM

(1)extract motif names from tomtom output

top_database_PSWM_motifs.txt

# in /composite_tomtom
# extract motif names from tomtom output
mkdir motifid_top_PSWM
cd motifid_top_PSWM

echo 'Extracting Motif Names'
for file in ../composite_tomtom/*family*tomtom_output/*.txt
do
    name=$(echo $file | awk -F"/" '{print $(NF-1)}' | awk -F".tomtom_output" '{print $1}')
    echo $name
    motifid=$name
    #echo $file
    linenum=$(awk 'END {print NR}' $file) #number of lines in the txt file
    #echo $linenum 
    first=2
    i=$first
    if [[ $linenum != 5 ]]
    then
    mkdir $name
      while [[  $i -le $linenum ]] #less than or equal to
      do
        head -$i $file | tail -1 > lastline
        mapid[$i]=$(awk 'END {print $2}' lastline | awk -F"(" '{print $1}')
        #echo mapid[$i]_${mapid[$i]}
        echo ${mapid[$i]} >> $name/motifidlist_$motifid.txt
        ((i = i + 1))
      done
    head -n $(( $(wc -l $name/motifidlist_$motifid.txt | awk '{print $1}') - 4 )) \
        $name/motifidlist_$motifid.txt > $name/motifidlist_$motifid.final.txt
    rm $name/motifidlist_$motifid.txt
    mv $name/motifidlist_$motifid.final.txt $name/motifidlist_$motifid.txt
    fi
done
rm lastline

cat *family/*.txt> top_database_PSWM_motifs.txt
cp top_database_PSWM_motifs.txt ..
sort top_database_PSWM_motifs.txt | uniq -u > top_database_PSWM_motifs_uniq.txt
cp top_database_PSWM_motifs_uniq.txt ..
cd ..
# in /composite_tomtom
# extract motif names from tomtom output

mkdir motifid_top_PSWM_top
cd motifid_top_PSWM_top

echo 'Extracting top hitted Motif Names in each family'
for file in ../composite_tomtom/*family*tomtom_output/*.txt
do
    name=$(echo $file | awk -F"/" '{print $(NF-1)}' | awk -F".tomtom_output" '{print $1}')
    echo $name
    motifid=$name
    #echo $file
    linenum=$(awk 'END {print NR}' $file) #number of lines in the txt file
    #echo $linenum 
    first=2
    i=$first
      if [[ $linenum != 5 ]]
      then
      
      sed '1d; /^$/d; /#/d' $file > ../composite_tomtom/$name.tomtom_output/tomtom_comp.txt
      fi
done

for file in ../composite_tomtom/*family*tomtom_output/tomtom_comp.txt
do
      name=$(echo $file | awk -F"/" '{print $(NF-1)}' | awk -F".tomtom_output" '{print $1}')
      echo $name
      awk 'NR == 1 || $4 < min {line = $0; min = $4}END{print line}' $file | awk '{print $2}' >> motifidlist_$name.txt
    
done


cat *.txt> top_database_PSWM_motifs1.txt
cp top_database_PSWM_motifs1.txt ..


sort top_database_PSWM_motifs1.txt | uniq -u > top_database_PSWM_motifs1_uniq.txt
cp top_database_PSWM_motifs_uniq1.txt ..
cd ..
  1. extract individual meme files from combined database
#!/bin/bash

#tomtom all query factors against all others
mkdir top_database_PSWM
cd top_database_PSWM


cat ../top_database_PSWM_motifs_uniq.txt | while read factor
do
    echo $factor
    cp ../individual_memes/${factor}_meme.txt $PWD
done

cat *meme.txt > ../all_query_factors_meme_top.txt
#!/bin/bash

#tomtom all query factors against all others
mkdir top_database_PSWM_top
cd top_database_PSWM_top


cat ../top_database_PSWM_motifs1_uniq.txt | while read factor
do
    echo $factor
    cp ../individual_memes/${factor}_meme.txt $PWD
done

cat *meme.txt > ../all_query_factors_meme_top_top.txt

11.3.6 Genomic FIMO of the top database PSWM

fimo_composites_top.sh

#!/bin/bash

#SBATCH --job-name=fimo_composites_top     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 4                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=200G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

module load gcc meme/5.4.1
# export temp dir
export TMPDIR="${HOME}/temp" 

mkdir fimo_top_database_PSWM
cd fimo_top_database_PSWM

for i in ../top_database_PSWM/*_meme.txt
do
    name=$(echo $i | awk -F"../top_database_PSWM/" '{print $NF}' | awk -F"_meme.txt" '{print $1}')
    echo $name
    
    #run FIMO
    
    fimo --thresh 0.001 --text $i \
        /home/FCAM/ssun/motifanalysis/mm39.fa \
        > ${name}_fimo.txt

    #this takes top 2M
    score=$(tail -n +2 ${name}_fimo.txt | sort -nrk6,6 | awk 'FNR == 2000000 {print $6}')
    echo $score
    tail -n +2 ${name}_fimo.txt | awk -v sco="$score" '{ if ($6 >= sco) { print } }' | \
        awk '{OFS="\t";} {print $2,$3,$4,$7,$6,$5,$8}' > ${name}_2M.txt

    #this was to get the order of conformity to consensus.
    tomtom -no-ssc -oc ${name}_ranks.tomtom_output -verbosity 1 -incomplete-scores -min-overlap 1 \
        -dist ed -evalue -thresh 0.05 ../all_query_factors_meme.txt ${name}_fimo.txt
done

fimo_top_new.sh

#!/bin/bash

#SBATCH --job-name=fimo_composites_top     # name for job
#SBATCH -N 1                    # number of nodes 
#SBATCH -n 1                    # number of jobs / tasks 
#SBATCH -c 4                    # number of cores 
#SBATCH -p general           # SLURM partition 
#SBATCH --qos=general        # SLURM Quality of service 
#SBATCH --mem=200G                # RAM (memory) requested 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o scriptname.sh_%j.out
#SBATCH -e scriptname.sh_%j.err

module load gcc meme/5.4.1
# export temp dir
export TMPDIR="${HOME}/temp" 

mkdir fimo_top_database_PSWM_top
cd fimo_top_database_PSWM_top

for i in ../top_database_PSWM_top/*_meme.txt
do
    name=$(echo $i | awk -F"../top_database_PSWM_top/" '{print $NF}' | awk -F"_meme.txt" '{print $1}')
    echo $name
    
    #run FIMO
    
    fimo --thresh 0.001 --text $i \
        /home/FCAM/ssun/motifanalysis/mm39.fa \
        > ${name}_fimo.txt

    #this takes top 2M
    score=$(tail -n +2 ${name}_fimo.txt | sort -nrk6,6 | awk 'FNR == 2000000 {print $6}')
    echo $score
    tail -n +2 ${name}_fimo.txt | awk -v sco="$score" '{ if ($6 >= sco) { print } }' | \
        awk '{OFS="\t";} {print $2,$3,$4,$7,$6,$5,$8}' > ${name}_2M.txt

done

11.3.7 top motif hit seqlogos

seqlogo

12 Generating comprehensive ATAC dataframe

Non-dynamic peaks: padj > 0.5 & log2FoldChange < 0.25
N=49111
Dynamic peaks: padj <= 0.045
N=34056

#!/bin/bash
module load gcc bedtools
for i in *_2M.txt
do
    name=$(echo $i | awk -F"/" '{print $NF}' | awk -F"_2M.txt" '{print $1}')
    echo $name
    intersectBed -loj -a ../dynamic_peaks.bed -b $i > ${name}_fimo.bed
    intersectBed -loj -a ../nondynamic_peaks.bed -b $i > ${name}_fimo_nondyn.bed
    intersectBed -loj -a ../allother_peaks.bed -b $i > ${name}_fimo_allother.bed
    cat $i | cut -f1-3,5 | sort -k1,1 -k2,2n > ${name}_2M.bed 
done
#!/bin/bash
mkdir main_figure_beds
cp *_fimo.bed ../main_figure_beds
library(ggplot2)
library(lattice)
library(pheatmap)
library(fabricatr)
library(ggseqlogo)

load('plot.df.atac.Rdata')

#generate 'fimo.scores.atac.Rdata' object
res = unique(read.table('dynamic_peaks.bed'))
colnames(res) = c('chr', 'start', 'end')
res$start = as.numeric(as.character(res$start))
res$end = as.numeric(as.character(res$end))
rownames(res) = paste0(res[,1], ':', res[,2], '-', res[,3])

#main figure families

for(bed.file in Sys.glob(file.path('*_fimo.bed'))) {
    
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]   
    y = aggregate(as.numeric(V8)~V1+V2+V3, data=x, FUN=sum)
    colnames(y) = c('chr', 'start', 'end', factor.name)
    rownames(y) = paste0(y[,1], ':', y[,2], '-', y[,3])
    
    func <- function(peak) {
        if(peak %in% rownames(y)) {
            return(y[rownames(y) == peak,ncol(y)])
        } else {
            return(NA)
        }
    }
    
    res[,ncol(res)+1] = sapply(rownames(res),func)
    colnames(res)[ncol(res)] = factor.name

}

res = res[,-c(1:3)]

fimo.scores.atac = res
save(fimo.scores.atac, file = 'fimo.scores.atac.Rdata')

#save as bed files
for(i in 1:ncol(fimo.scores.atac)) {
    temp = fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]
    chr = sapply(strsplit(rownames(temp), ':'), '[', 1)
    x = sapply(strsplit(rownames(temp), ':'), '[', 2)
    start = sapply(strsplit(x, '-'), '[', 1)
    end = sapply(strsplit(x, '-'), '[', 2)
    bed = data.frame(chr, start, end, score = temp[,i])
    write.table(bed, file = paste0(colnames(fimo.scores.atac)[i],'_instances.bed'),
        row.names=F,col.names=F,quote=F,sep='\t')
}

non-dynamic peaks as baseline

# generate fimo.scores.nondyn.atac.Rdata
res_c = unique(read.table('nondynamic_peaks.bed'))
colnames(res_c) = c('chr', 'start', 'end')
res_c$start = as.numeric(as.character(res_c$start))
res_c$end = as.numeric(as.character(res_c$end))
rownames(res_c) = paste0(res_c[,1], ':', res_c[,2], '-', res_c[,3])

#main figure families

for(bed.file in Sys.glob(file.path('nondyn.bed/*_fimo_nondyn.bed'))) {
    
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo_nondyn.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]   
    y = aggregate(as.numeric(V8)~V1+V2+V3, data=x, FUN=sum)
    colnames(y) = c('chr', 'start', 'end', factor.name)
    rownames(y) = paste0(y[,1], ':', y[,2], '-', y[,3])
    
    func <- function(peak) {
        if(peak %in% rownames(y)) {
            return(y[rownames(y) == peak,ncol(y)])
        } else {
            return(NA)
        }
    }
    
    res_c[,ncol(res_c)+1] = sapply(rownames(res_c),func)
    colnames(res_c)[ncol(res_c)] = factor.name

}

res_c = res_c[,-c(1:3)]

fimo.scores.nondyn.atac = res_c
save(fimo.scores.nondyn.atac, file = 'fimo.scores.nondyn.atac.Rdata')

all other peaks

# generate fimo.scores.all.other.atac.Rdata
res_c = unique(read.table('allother_peaks.bed'))
colnames(res_c) = c('chr', 'start', 'end')
res_c$start = as.numeric(as.character(res_c$start))
res_c$end = as.numeric(as.character(res_c$end))
rownames(res_c) = paste0(res_c[,1], ':', res_c[,2], '-', res_c[,3])

#main figure families

for(bed.file in Sys.glob(file.path('all_other_bed2/*_fimo_allother.bed'))) {
    
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo_allother.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]   
    y = aggregate(as.numeric(V8)~V1+V2+V3, data=x, FUN=sum)
    colnames(y) = c('chr', 'start', 'end', factor.name)
    rownames(y) = paste0(y[,1], ':', y[,2], '-', y[,3])
    
    func <- function(peak) {
        if(peak %in% rownames(y)) {
            return(y[rownames(y) == peak,ncol(y)])
        } else {
            return(NA)
        }
    }
    
    res_c[,ncol(res_c)+1] = sapply(rownames(res_c),func)
    colnames(res_c)[ncol(res_c)] = factor.name

}
res_c = res_c[,-c(1:3)]

fimo.scores.all.other.atac = res_c
save(fimo.scores.all.other.atac, file = 'fimo.scores.all.other.atac.Rdata')
plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$genes = as.factor(plot.df$genes)
no.fam = rownames(fimo.scores.atac[which(rowSums(is.na(fimo.scores.atac)) == ncol(fimo.scores.atac)),])
temp = plot.df.atac[plot.df.atac$genes %in% no.fam,]
temp$family = 'No Family'
plot.df = rbind(plot.df,temp)

plot.df$status = 'Activated'
plot.df[plot.df$supercluster == 'down.up' | plot.df$supercluster == 'down.up.down',]$status = 'Repressed'

12.0.1 add status info

load("fimo.scores.atac.Rdata") # contains increase or decreased peaks
load("fimo.scores.nondyn.atac.Rdata") # contains unchanged peaks
load('fimo.scores.all.other.atac.Rdata') # contains all other peaks
load('plot.df.atac.Rdata') 

plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$genes = as.factor(plot.df$genes)
no.fam = rownames(fimo.scores.atac[which(rowSums(is.na(fimo.scores.atac)) == ncol(fimo.scores.atac)),])
temp = plot.df.atac[plot.df.atac$genes %in% no.fam,]
temp$family = 'No Family'

plot.df = rbind(plot.df,temp)

plot.df$status = 'Activated'
plot.df[plot.df$supercluster == 'down.up' | plot.df$supercluster == 'down.up.down',]$status = 'Repressed'

df1<-fimo.scores.nondyn.atac
df1$genes<-rownames(df1)
df1$chr = sapply(strsplit(df1$genes, ':|-'), '[', 1)
df1$start = sapply(strsplit(df1$genes, ':|-'), '[', 2)
df1$end = sapply(strsplit(df1$genes, ':|-'), '[', 3)
df1$status<-"unchanged"

df2<-fimo.scores.atac
df2$genes<-rownames(df2)
df2$chr = sapply(strsplit(df2$genes, ':|-'), '[', 1)
df2$start = sapply(strsplit(df2$genes, ':|-'), '[', 2)
df2$end = sapply(strsplit(df2$genes, ':|-'), '[', 3)
label<-unique(plot.df[,c(16,18)])
df2<-merge(df2,label,by="genes")
df2[df2$status=="Activated",]$status="Gain Accessibility"
df2[df2$status=="Repressed",]$status="Lose Accessibility"

df3<-fimo.scores.all.other.atac
df3$genes<-rownames(df3)
df3$chr = sapply(strsplit(df3$genes, ':|-'), '[', 1)
df3$start = sapply(strsplit(df3$genes, ':|-'), '[', 2)
df3$end = sapply(strsplit(df3$genes, ':|-'), '[', 3)
df3$status<-"all.other"

fimo_scores_allpeaks<-rbind(df1,df2,df3)
save(fimo_scores_allpeaks, file = 'fimo.scores.allpeaks.atac.Rdata')
load('fimo.scores.allpeaks.atac.Rdata')
df = data.frame(fimo_scores_allpeaks)

12.0.2 add counts

# add counts
load('normalized.counts.atac.Rdata')
zero.min = c()
fifteen.min = c()
thirty.min = c()
fourtyfive.min = c()

genes = c()

print('Getting ATAC means')
for (i in 1:nrow(normalized.counts)) {
    zero.min = append(zero.min, mean(normalized.counts[i,13:16]))
    fifteen.min = append(fifteen.min, mean(normalized.counts[i,1:4]))
    thirty.min = append(thirty.min, mean(normalized.counts[i,5:8]))
    fourtyfive.min = append(fourtyfive.min, mean(normalized.counts[i,9:12]))
    
    genes = append(genes,rownames(normalized.counts)[i])
}

atac.means = data.frame(row.names = genes, 
                        min.0 = zero.min, 
                        min.15 = fifteen.min,
                        min.30 = thirty.min, 
                        min.45 = fourtyfive.min)

save(atac.means,file='atac.means.Rdata')

atac.means$genes<-rownames(atac.means)
df = merge(df, atac.means, by='genes')

12.0.3 add cluster info

# add cluster information
# down.up: cluster 4 & 6
# down.up.down: cluster 1
# gradual.up: cluster 5
# up.down: cluster 3 & 7
# up.down.up: cluster 2

cluster.df = unique(plot.df.atac[,c(5,15,16)])

df = merge(df,cluster.df,by='genes', all=T)

#rownames(df) = df[,1]
#df = df[,-1]

atac.comprehensive =df
save(atac.comprehensive,file='atac.comprehensive.Rdata')

13 motif analysis visualization

13.0.1 prepare data

load("fimo.scores.atac.Rdata") # contains increase or decreased peaks
load("fimo.scores.nondyn.atac.Rdata") # contains unchanged peaks
load('fimo.scores.all.other.atac.Rdata') # contains all other peaks
load('plot.df.atac.Rdata') 

plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$genes = as.factor(plot.df$genes)
no.fam = rownames(fimo.scores.atac[which(rowSums(is.na(fimo.scores.atac)) == ncol(fimo.scores.atac)),])
temp = plot.df.atac[plot.df.atac$genes %in% no.fam,]
temp$family = 'No Family'

plot.df = rbind(plot.df,temp)

plot.df$status = 'Activated'
plot.df[plot.df$supercluster == 'down.up' | plot.df$supercluster == 'down.up.down',]$status = 'Repressed'
plot.df1=unique(plot.df[,c(5,15:18)])
plot.df2 = data.frame()
for(i in 1:ncol(fimo.scores.nondyn.atac)) {
    
     temp=fimo.scores.nondyn.atac[!is.na(fimo.scores.nondyn.atac[,i]),]
     temp$family=colnames(fimo.scores.nondyn.atac)[i]
     temp$genes=rownames(temp)
     plot.df2 = rbind(plot.df2,temp[,10:11])
}

plot.df2$genes = as.factor(plot.df2$genes)
no.fam = rownames(fimo.scores.nondyn.atac[which(rowSums(is.na(fimo.scores.nondyn.atac)) == ncol(fimo.scores.nondyn.atac)),])
temp = data.frame(no.fam)
colnames(temp)="genes"
temp$family = 'No Family'

plot.df2 = rbind(plot.df2,temp)

plot.df2$status = 'unchanged'
plot.df2$cluster = 'unchanged'
plot.df2$supercluster = 'unchanged'
plot.df.sum=rbind(plot.df1, plot.df2)
plot.df.sum[plot.df.sum$status=="Activated",]$status="Gain Accessibility"
plot.df.sum[plot.df.sum$status=="Repressed",]$status="Lose Accessibility"

raw counts

activated = table(plot.df.sum[plot.df.sum$status == 'Gain Accessibility',]$family) 
repressed = table(plot.df.sum[plot.df.sum$status == 'Lose Accessibility',]$family) 
unchanged = table(plot.df.sum[plot.df.sum$status == 'unchanged',]$family) 

df = data.frame(names = c(names(activated),names(repressed), names(unchanged)),
                num = c(as.vector(activated),as.vector(repressed), as.vector(unchanged)),
                cond = c(rep('Gain Accessibility',length(activated)),rep('Lose Accessibility',length(repressed)),rep('unchanged',length(unchanged))),
                index = rep(as.vector(table(plot.df$family)),3))
df[df$names == 'No Family',]$index = min(df$index) - 1

add fraction

df_sum<-df
df_sum$fraction=c(df_sum[df_sum$cond=="Gain Accessibility",]$num/length(unique(plot.df.sum[plot.df.sum$status=="Gain Accessibility",]$genes)), df_sum[df_sum$cond=="Lose Accessibility",]$num/length(unique(plot.df.sum[plot.df.sum$status=="Lose Accessibility",]$genes)), df_sum[df_sum$cond=="unchanged",]$num/length(unique(plot.df.sum[plot.df.sum$status=="unchanged",]$genes)))
df_sum$fraction<-(df_sum$fraction)*100

13.0.2 count the number/fraction of up, down, unchanged ATAC peaks with a genomic instance of the PSWM

  1. plot number of peaks in each motif family
pdf(file="percent of each motif family occurrance.pdf",width=8,height=5)
    print(
        ggplot(df_sum,aes(x = reorder(names,-num),y = fraction,fill=cond)) +
        geom_bar(stat='identity',position='dodge',color='black') + 
        labs(title = "% of Peaks For Each Motif Family",
             y = "% of Peaks",
             x = 'Motif Family',
             fill = 'Direction') +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank(),
              axis.text.x = element_text(angle=45,size=20,hjust=.99,vjust=1,color='black',face='bold'),
              axis.text.y = element_text(size=20,face='bold',color='black'),
              axis.title.y = element_text(size=18,face='bold'),
              legend.text = element_text(size=18,face='bold'),
              plot.title = element_text(size=22,face='bold',hjust=0.5)) +
        scale_fill_manual(values = c("red","blue","gray"))
    )
dev.off() 

motif_occurance (2) plot number of peaks in each motif family (by cluster)

13.0.3 ChiSq test

fimo.scores.atac1<-fimo.scores.atac
fimo.scores.atac1$genes<-rownames(fimo.scores.atac1)

#increased peaks
Increased_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$status=="Gain Accessibility",]$genes)) 
colnames(Increased_peaks)<-"genes"
Increased_peaks<-merge(fimo.scores.atac1, Increased_peaks, by="genes")

#decreased peaks
Decreased_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$status=="Lose Accessibility",]$genes)) 
colnames(Decreased_peaks)<-"genes"
Decreased_peaks<-merge(fimo.scores.atac1, Decreased_peaks, by="genes")

#unchanged peaks
unchanged_peaks<-fimo.scores.nondyn.atac
unchanged_peaks$genes<-rownames(fimo.scores.nondyn.atac)
df_sum_withfam <-matrix(ncol=3, nrow=9)
for (i in 1:9){
  df_sum_withfam[i,1]<- as.numeric(table(is.na(unchanged_peaks[[i]]))[[1]])
  df_sum_withfam[i,2]<- as.numeric(table(is.na(Increased_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,3]<- as.numeric(table(is.na(Decreased_peaks[[i+1]]))[[1]])
}
df_sum_withfam<-as.data.frame(df_sum_withfam)
rownames(df_sum_withfam)<-c(colnames(unchanged_peaks)[1:9])
colnames(df_sum_withfam)<-c("unchanged","Gain Accessibility", "Lose Accessibility")

#no CRE
df_sum_nofam <-matrix(ncol=3, nrow=9)
for (j in 1:9){
  df_sum_nofam[j,1]<-as.numeric(table(is.na(unchanged_peaks[[j]]))[[2]])
  df_sum_nofam[j,2] <-as.numeric(table(is.na(Increased_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,3] <-as.numeric(table(is.na(Decreased_peaks[[j+1]]))[[2]])
}

df_sum_nofam<-as.data.frame(df_sum_nofam)
rownames(df_sum_nofam)<-c("no CRE", "no ETV1","no Egr1","no Fosl2","no KLF13","no Klf4","no NFYB", "no RREB1", "no Ronin")
colnames(df_sum_nofam)<-c("unchanged","Gain Accessibility", "Lose Accessibility")

df_sum2<-rbind(df_sum_nofam, df_sum_withfam)
chidata = data.frame(matrix(ncol = 5, nrow = 9))
for (i in 1:9){
  chidata$fam_vs_no_spe_fam[i]<-as.character(rownames(df_sum2)[i+9])
  #chidata$Gain_X_squared[i]<-chisq.test(df_sum2[c(i,i+9),1:2], correct=F)$statistic
  chidata$Gain_p_value[i]<-chisq.test(df_sum2[c(i,i+9),1:2], correct=F)$p.value
  #chidata$lose_X_squared[i]<-chisq.test(df_sum2[c(i,i+9),1:3], correct=F)$statistic
  chidata$lose_p_value[i]<-chisq.test(df_sum2[c(i,i+9),1:3], correct=F)$p.value
  #chidata$lose_Gain_X_squared[i]<-chisq.test(df_sum2[c(i,i+9),2:3], correct=F)$statistic
  chidata$lose_Gain_p_value[i]<-chisq.test(df_sum2[c(i,i+9),2:3], correct=F)$p.value
  
}
chidata_p<-chidata[order(chidata$lose_Gain_p_value),][,6:9]

chisq_test

13.0.4 bar plot

df_sum2_fraction<-df_sum2
df_sum2_fraction$`Gain Accessibility`<-100*df_sum2_fraction$`Gain Accessibility`/nrow(Increased_peaks)
df_sum2_fraction$unchanged<-100*df_sum2_fraction$unchanged/nrow(unchanged_peaks)
df_sum2_fraction$`Lose Accessibility`<-100*df_sum2_fraction$`Lose Accessibility`/nrow(Decreased_peaks)

library("reshape")
df_sum2_fraction_mut<-df_sum2_fraction
df_sum2_fraction_mut$family<-rownames(df_sum2_fraction)
df_sum2_fraction_mut<-melt(df_sum2_fraction_mut, 
                           id.vars="family",
                           variable.name="class",
                           value.name="fraction")

df_sum2_fraction_mut$variable <- factor(df_sum2_fraction_mut$variable, levels = c('Gain Accessibility', 'unchanged', 'Lose Accessibility'))
library(ggplot2)

plot_cre<-ggplot(df_sum2_fraction_mut[c(1,10,19,28,37,46),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without CRE', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
        theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_etv1<-ggplot(df_sum2_fraction_mut[c(2,11,20,29,38,47),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without ETV1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))


plot_Egr1<-ggplot(df_sum2_fraction_mut[c(3,12,21,30,39,48),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Egr1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_Fosl2<-ggplot(df_sum2_fraction_mut[c(4,13,22,31,40,49),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Fosl2 ', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))


plot_KLF13<-ggplot(df_sum2_fraction_mut[c(5,14,23,32,41,50),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without KLF13', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_Klf4<-ggplot(df_sum2_fraction_mut[c(6,15,24,33,42,51),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Klf4', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_NFYB<-ggplot(df_sum2_fraction_mut[c(7,16,25,34,43,52),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without NFYB', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_RREB1<-ggplot(df_sum2_fraction_mut[c(8,17,26,35,44,53),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without RREB1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("blue","red")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_Ronin<-ggplot(df_sum2_fraction_mut[c(9,18,27,36,45,54),],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Ronin', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("blue","red")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))
#library(ggplot2)
library(patchwork)

pdf(file='Proportions of peaks with or without each family motif.pdf', width=8,height=5)

(plot_cre | plot_etv1)/
  (plot_Egr1 | plot_Fosl2)

  ( plot_KLF13 | plot_Klf4)/ 
(plot_NFYB |plot_RREB1 )

  plot_Ronin

dev.off()

peak_proportion

13.0.5 bar plot by cluster

fimo.scores.atac1<-fimo.scores.atac
fimo.scores.atac1$genes<-rownames(fimo.scores.atac1)

# down.up: cluster 4 & 6
down.up_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="down.up",]$genes)) 
colnames(down.up_peaks)<-"genes"
down.up_peaks<-merge(fimo.scores.atac1, down.up_peaks, by="genes")

# down.up.down: cluster 1
down.up.down_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="down.up.down",]$genes)) 
colnames(down.up.down_peaks)<-"genes"
down.up.down_peaks<-merge(fimo.scores.atac1, down.up.down_peaks, by="genes")

# gradual.up: cluster 5
gradual.up_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="gradual.up",]$genes)) 
colnames(gradual.up_peaks)<-"genes"
gradual.up_peaks<-merge(fimo.scores.atac1, gradual.up_peaks, by="genes")

# up.down: cluster 3 & 7
up.down_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="up.down",]$genes)) 
colnames(up.down_peaks)<-"genes"
up.down_peaks<-merge(fimo.scores.atac1, up.down_peaks, by="genes")

# up.down.up: cluster 2
up.down.up_peaks<-as.data.frame(unique(plot.df.sum[plot.df.sum$supercluster=="up.down.up",]$genes)) 
colnames(up.down.up_peaks)<-"genes"
up.down.up_peaks<-merge(fimo.scores.atac1, up.down.up_peaks, by="genes")

#unchanged peaks
unchanged_peaks<-fimo.scores.nondyn.atac
unchanged_peaks$genes<-rownames(fimo.scores.nondyn.atac)
df_sum_withfam <-matrix(ncol=6, nrow=9)
for (i in 1:9){
  df_sum_withfam[i,1]<- as.numeric(table(is.na(unchanged_peaks[[i]]))[[1]])
  df_sum_withfam[i,2]<- as.numeric(table(is.na(down.up_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,3]<- as.numeric(table(is.na(down.up.down_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,4]<- as.numeric(table(is.na(gradual.up_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,5]<- as.numeric(table(is.na(up.down_peaks[[i+1]]))[[1]])
  df_sum_withfam[i,6]<- as.numeric(table(is.na(up.down.up_peaks[[i+1]]))[[1]])
 
}
df_sum_withfam<-as.data.frame(df_sum_withfam)
rownames(df_sum_withfam)<-c(colnames(unchanged_peaks)[1:9])
colnames(df_sum_withfam)<-c("unchanged","down.up", "down.up.down","gradual.up","up.down","up.down.up")

#no CRE
df_sum_nofam <-matrix(ncol=6, nrow=9)
for (j in 1:9){
  df_sum_nofam[j,1]<- as.numeric(table(is.na(unchanged_peaks[[j]]))[[2]])
  df_sum_nofam[j,2]<- as.numeric(table(is.na(down.up_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,3]<- as.numeric(table(is.na(down.up.down_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,4]<- as.numeric(table(is.na(gradual.up_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,5]<- as.numeric(table(is.na(up.down_peaks[[j+1]]))[[2]])
  df_sum_nofam[j,6]<- as.numeric(table(is.na(up.down.up_peaks[[j+1]]))[[2]])
}

df_sum_nofam<-as.data.frame(df_sum_nofam)
rownames(df_sum_nofam)<-c("no CRE", "no ETV1","no Egr1","no Fosl2","no KLF13","no Klf4","no NFYB", "no RREB1", "no Ronin")
colnames(df_sum_nofam)<-c("unchanged","down.up", "down.up.down","gradual.up","up.down","up.down.up")
df_sum2<-rbind(df_sum_nofam, df_sum_withfam)

df_sum2_fraction<-df_sum2
df_sum2_fraction$unchanged<-100*df_sum2_fraction$unchanged/nrow(unchanged_peaks)
df_sum2_fraction$down.up<-100*df_sum2_fraction$down.up/nrow(down.up_peaks)
df_sum2_fraction$down.up.down<-100*df_sum2_fraction$down.up.down/nrow(down.up.down_peaks)
df_sum2_fraction$gradual.up<-100*df_sum2_fraction$gradual.up/nrow(gradual.up_peaks)
df_sum2_fraction$up.down<-100*df_sum2_fraction$up.down/nrow(up.down_peaks)
df_sum2_fraction$up.down.up<-100*df_sum2_fraction$up.down.up/nrow(up.down.up_peaks)


library("reshape")
df_sum2_fraction_mut<-df_sum2_fraction
df_sum2_fraction_mut$family<-rownames(df_sum2_fraction)
df_sum2_fraction_mut<-melt(df_sum2_fraction_mut, 
                           id.vars="family",
                           variable.name="class",
                           value.name="fraction")

df_sum2_fraction_mut$variable <- factor(df_sum2_fraction_mut$variable, levels = c("gradual.up","up.down","up.down.up","unchanged", "down.up", "down.up.down" ))
library(ggplot2)

num=c(1,10,19,28,37,46,55,64,82,73,91,100)

plot_cre<-ggplot(df_sum2_fraction_mut[num,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without CRE', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
        theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_etv1<-ggplot(df_sum2_fraction_mut[num+1,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without ETV1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))


plot_Egr1<-ggplot(df_sum2_fraction_mut[num+2,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Egr1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_Fosl2<-ggplot(df_sum2_fraction_mut[num+3,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Fosl2', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))


plot_KLF13<-ggplot(df_sum2_fraction_mut[num+4,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without KLF13', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_Klf4<-ggplot(df_sum2_fraction_mut[num+5,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Klf4', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_NFYB<-ggplot(df_sum2_fraction_mut[num+6,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without NFYB', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("red","blue")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_RREB1<-ggplot(df_sum2_fraction_mut[num+7,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without RREB1', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("blue","red")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))

plot_Ronin<-ggplot(df_sum2_fraction_mut[num+8,],aes(x =variable ,y = value, fill=family)) +
        geom_bar(stat='identity',position='stack',color='black') +
        
        labs(title = 'Proportions of peaks with or without Ronin', y = '% of Peaks', x = NULL, fill = NULL) +
        theme_minimal() +
        theme(panel.grid.minor = element_blank(),
              axis.ticks = element_blank()) +
        scale_fill_manual(values = c("blue","red")) +
  theme(axis.text.x=element_text(angle=45, hjust=0.9))
#library(ggplot2)
library(patchwork)

pdf(file='Proportions of peaks with or without each family motif by cluster.pdf', width=8,height=5)

(plot_cre | plot_etv1)/
  (plot_Egr1 | plot_Fosl2)

  ( plot_KLF13 | plot_Klf4)/ 
(plot_NFYB |plot_RREB1 )

  plot_Ronin

dev.off()

peak_proportion

13.0.6 FIMO scores violin plot box and whisker plots for isolated peaks

plot.df = data.frame()

for(i in 1:ncol(fimo.scores.atac)) {
      scores.temp = fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]
      scores.temp = scores.temp[,-i]
      scores.temp = scores.temp[which(rowSums(is.na(scores.temp)) == ncol(scores.temp)),]
      temp = plot.df.atac[plot.df.atac$genes %in% rownames(scores.temp),]
      temp$family = colnames(fimo.scores.atac)[i]
      plot.df = rbind(plot.df,temp)
}

plot.df$status = 'Activated'
plot.df[plot.df$supercluster == 'down.up.down' | plot.df$supercluster == 'down.up',]$status = 'Repressed'
plot.df = unique(plot.df[,16:18])

func <- function(peak) {
    family = plot.df[plot.df$genes == peak,]$family    
    colnum = which(colnames(fimo.scores.atac) == family)
    score = fimo.scores.atac[rownames(fimo.scores.atac) == peak,colnum]
    return(score)
}

plot.df$score = sapply(plot.df$genes,func)
pdf(file = paste0('fimo.scores.isolated.peaks.violin.pdf'),width=12,height=8)

trellis.par.set(box.umbrella = list(lty = 1, col=c("red", "blue"), lwd=2),
                box.rectangle = list( lwd=2.0, col=c("red", "blue"), alpha = 1.0),
                plot.symbol = list(col=c("red", "blue"), lwd=2.0, pch ='.'))

print(
    bwplot(log(as.numeric(as.character(score)), base = 10) ~ status | family, data = plot.df, 
           horizontal = FALSE, pch = '|', do.out = FALSE,
           scales=list(x=list(cex=1.0, relation = "free", rot = 45), y = list(cex=1.0, relation="free")),
           aspect=2.0,
           between=list(y=0.5, x=0.5),
           index.cond=list(c(1,4,6,7,2,3,5,8,9)),
           ylab = expression('log'[10]* paste('(motif scores)')),
           xlab = expression('ATAC Peak category'),
                                        #manually setting to avoid outlier, since do.out = FALSE                
           panel = function(..., box.ratio) {
        panel.violin(..., col = "transparent", border = "grey60", varwidth = FALSE, 
            box.ratio = box.ratio)
        panel.bwplot(..., fill = NULL, box.ratio = 0.1)
    },
           #ylim = list(c(0.9, 1.9),  c(0.9, 1.7), c(0.8, 1.8), c(1, 2.15),  c(0.96, 1.4)),
           par.settings = list(superpose.symbol = list(pch = c(16),col=c('grey20'), cex =0.5),
                               #I want to change the background strip to the corresponding motif color
                               strip.background=list(col=c("grey80"))))
    
)
dev.off()

violin_plot

13.0.7 motif density plot

13.1 by unchanged, gain, lose

Generate composites for all activated and repressed peaks

pswm.fimo <- function(fimo.in, out = 'outfilename', nm = 'bHLH_Activated', rc = FALSE) {
    posnum = nchar(fimo.in$sequence[1])
    fimo.in$sequence = toupper(fimo.in$sequence)
    col.matrix = matrix()
    for (g in 1:posnum){
        itnum = lapply(strsplit(as.character(fimo.in$sequence), ''), "[", g)
        if (g == 1) {
            col.matrix = itnum
        } else {
            col.matrix = cbind(col.matrix, itnum)
        }
    }  
  a.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "A"))
  t.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "T"))
  c.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "C"))
  g.nuc = sapply(1:posnum, function(x) sum(col.matrix[,x] == "G"))
  
  pswm = cbind(a.nuc, c.nuc, g.nuc, t.nuc)
  print(pswm)
  outfile = file(paste0(out, '.txt'))
  on.exit(close(outfile))
  writeLines(c("MEME version 4", "ALPHABET= ACGT", "strands: + -", " ", 
               "Background letter frequencies (from uniform background):", 
               "A 0.30000 C 0.20000 G 0.20000 T 0.30000", paste("MOTIF", out), " ",
               paste("letter-probability matrix: alength= 4 w=", posnum)), outfile)
  pswm = pswm/rowSums(pswm)
  if (rc == "TRUE"){
    pswm<- pswm[nrow(pswm):1,ncol(pswm):1]
  } else {}   
    write.table(pswm, file = paste0(out, '.txt'), append = TRUE, 
                quote=FALSE, row.names =FALSE, col.names = FALSE)
    pswm = t(pswm)
    rownames(pswm) = c('A', 'C', 'G', 'T')
    return(pswm)
    #    
#    system(paste0('/Users/guertinlab/meme/libexec/meme-5.1.1/ceqlogo -i ', out, '.txt -m Composite -o ', nm, '.eps'))
#    system(paste0('/Users/guertinlab/meme/libexec/meme-5.1.1/ceqlogo -i ', out, '.txt -m Composite -o ', nm, '.rc.eps -r'))
}
plot.df = plot.df.atac
plot.df$status = 'Activated'
plot.df[plot.df$supercluster == 'down.up.down' | plot.df$supercluster == 'down.up',]$status = 'Repressed'
count = -1
vec.names = c()

df.seq = as.data.frame(matrix(ncol=5, nrow=0),stringsAsFactors = FALSE)
colnames(df.seq) = c('chr', 'start', 'end', 'sequence', 'factor')

for(bed.file in Sys.glob(file.path('*_fimo.bed'))) {
    #print(bed.file)
    count = count + 1
    factor.name = strsplit(bed.file, "/")[[1]]
    factor.name = strsplit(factor.name[length(factor.name)],
                           '_fimo.bed')[[1]][1]
    print(factor.name)
    x = read.table(bed.file, stringsAsFactors=FALSE)
    x = x[x[,6] != -1,]
    x = x[,c(1,2,3,10)]
    x$factor = factor.name
    colnames(x) = c('chr', 'start', 'end', 'sequence', 'factor')
    x$re = paste0(x[,1], ':', x[,2], '-', x[,3])
    df.seq = rbind(df.seq, x) 
}
func <- function(peak) {
     return(unique(plot.df[plot.df$genes == peak,]$status))
}

df.seq$status = sapply(df.seq$re,func)

for(i in unique(df.seq$factor)) {
      act = df.seq[df.seq$factor == i & df.seq$status == 'Activated',]
      rep = df.seq[df.seq$factor == i & df.seq$status == 'Repressed',]
      pswm.fimo(act, out = paste0(i,'_activated'))
      pswm.fimo(rep, out = paste0(i,'_repressed'))
}

Generating bigWig for motif enrichment plot

on xanadu:

module load UCSC-tools/3.7.4 gcc/9.2.0 bedtools/2.29.2
#Unable to locate a modulefile for 'ucsc-tools'
cd fimo_top_database_PSWM_top/
for bed in *2M.bed
do
    name=$(echo $bed | awk -F"/" '{print $NF}' | awk -F"_2M.bed" '{print $1}')
    echo $name
    #summing scores of motifs w/in overlapping genomic interval
    cat $bed | mergeBed -i stdin -c 4 -o sum > ${name}_merged_2M.bed
    bedGraphToBigWig ${name}_merged_2M.bed ../../../mm39.chrom.sizes ${name}_mm39_instances.bigWig
    rm ${name}_merged_2M.bed
done
module load R/4.1.2
library(lattice)
library(bigWig)

plot.fimo.lattice <- function(dat, fact = 'Motif', summit = 'Hypersensitivity Summit', class= '',
                              num.m = -200, num.p =90, y.low =0, y.high = 0.2,
                              col.lines = c(rgb(0,0,1,1/2), rgb(1,0,0,1/2),
                                            rgb(0.1,0.5,0.05,1/2), rgb(0,0,0,1/2),
                                            rgb(1/2,0,1/2,1/2), rgb(0,1/2,1/2,1/2), rgb(1/2,1/2,0,1/2)),
                              fill.poly = c(rgb(0,0,1,1/4), rgb(1,0,0,1/4), rgb(0.1,0.5,0.05,1/4),
                                            rgb(0,0,0,1/4), rgb(1/2,0,1/2,1/4))) {
    
    pdf('motif_enrichment_around_summits.pdf')#, width=6.83, height=3.5)
    print(xyplot(density ~ range|tf, groups = category, data = dat, type = 'l',
                 scales=list(x=list(cex=0.8,relation = "free"), 
                 y =list(cex=0.8,axs = 'i',relation = "free")),
                 xlim=c(num.m,num.p),                                       
                 col = col.lines,
                 auto.key = list(points=F, lines=T, cex=0.8, columns = 2),
                 par.settings = list(superpose.symbol = list(pch = c(16), col=col.lines, cex =0.7),
                                     superpose.line = list(col = col.lines, lwd=c(2,2,2,2,2,2),
                                                           lty = c(1,1,1,1,1,1,1,1,1))),
                 cex.axis=1.0,
                 par.strip.text=list(cex=0.9, font=1, col='black',font=2),
                 aspect=1.0,
                 between=list(y=0.5, x=0.5),
                 index.cond = list(c(1,4,6,7,2,3,5,8,9)),
                 lwd=2,
                 ylab = list(label = "Weighted Motif Density", cex =1,font=2),
                 xlab = list(label = 'Distance from ATAC-seq Peak Summit', cex =1,font=2),
                 strip = function(..., which.panel, bg) {
                     bg.col = 'grey'#c("blue","grey65","red")
                     strip.default(..., which.panel = which.panel, 
                                    bg = rep(bg.col, length = which.panel)[which.panel])
                 }
                 ))
    dev.off()
}

bed.window <- function(bed, half.window) {
  bed[,2] = (bed[,2]+bed[,3])/2-half.window 
  bed[,3] = bed[,2] + 2 * half.window 
  return(bed)
  }

load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/plot.df.atac.Rdata')
load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/fimo.scores.atac.Rdata')

plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$status = 'Activated'
plot.df[plot.df$supercluster == 'down.up.down' | plot.df$supercluster == 'down.up',]$status = 'Repressed'
all.fimo = data.frame(matrix(ncol = 4, nrow = 0))
colnames(all.fimo) = c('density', 'tf', 'category', 'range')

half.win = 600
file.suffix = '_mm39_instances.bigWig'
dir = '/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/fimo_top_database_PSWM_top'
setwd(dir)

decreased = plot.df[plot.df$status == 'Repressed',12:14]
decreased[,2] = as.numeric(decreased[,2])
decreased[,3] = as.numeric(decreased[,3])
decreased = bed.window(decreased,half.win)

increased = plot.df[plot.df$status == 'Activated',12:14]
increased[,2] = as.numeric(increased[,2])
increased[,3] = as.numeric(increased[,3])
increased = bed.window(increased,half.win)

not.different = read.table('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/nondynamic_peaks.bed')
not.different = not.different[not.different$V1 != 'chrM',]
not.different = bed.window(not.different,half.win)

all.fimo = data.frame()

for(i in 1:ncol(fimo.scores.atac)) {
    factor = colnames(fimo.scores.atac)[i]

    mod.bigWig = paste0(factor,file.suffix)
    factor.name = factor
    print(factor.name)

    loaded.bw = load.bigWig(mod.bigWig)
    
    dec.inten = bed.step.probeQuery.bigWig(loaded.bw, decreased,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    dec.query.df = data.frame(cbind(colMeans(dec.inten), factor.name,
                                    'Lose', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(dec.query.df) = c('density', 'tf', 'category', 'range')
    
    inc.inten = bed.step.probeQuery.bigWig(loaded.bw, increased,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    inc.query.df = data.frame(cbind(colMeans(inc.inten), factor.name,
                                    'Gain', seq(-half.win,(half.win-10), 10)), stringsAsFactors=F)
    colnames(inc.query.df) = c('density', 'tf', 'category', 'range')
    
    ctrl.inten = bed.step.probeQuery.bigWig(loaded.bw, not.different,
                                            gap.value = 0, step = 10, as.matrix = TRUE)
    ctrl.query.df = data.frame(cbind(colMeans(ctrl.inten), factor.name,
                                     'unchanged', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(ctrl.query.df) = c('density', 'tf', 'category', 'range')
    
    tf.all = rbind(dec.query.df, inc.query.df, ctrl.query.df)

    all.fimo = rbind(all.fimo,tf.all)
}

all.fimo[,1] = as.numeric(all.fimo[,1])
all.fimo[,4] = as.numeric(all.fimo[,4])

plot.fimo.lattice(all.fimo, num.m = -500, num.p = 500,
                  col.lines = c('red','blue','grey65'))

motif_density ## by supercluster module load R/4.1.2

library(lattice)
library(bigWig)

plot.fimo.lattice <- function(dat, fact = 'Motif', summit = 'Hypersensitivity Summit', class= '',
                              num.m = -200, num.p =90, y.low =0, y.high = 0.2,
                              col.lines = c(rgb(0,0,1,1/2), rgb(1,0,0,1/2),
                                            rgb(0.1,0.5,0.05,1/2), rgb(0,0,0,1/2),
                                            rgb(1/2,0,1/2,1/2), rgb(0,1/2,1/2,1/2), rgb(1/2,1/2,0,1/2)),
                              fill.poly = c(rgb(0,0,1,1/4), rgb(1,0,0,1/4), rgb(0.1,0.5,0.05,1/4),
                                            rgb(0,0,0,1/4), rgb(1/2,0,1/2,1/4))) {
    
    pdf('motif_enrichment_around_summits_by_cluster.pdf')#, width=6.83, height=3.5)
    print(xyplot(density ~ range|tf, groups = category, data = dat, type = 'l',
                 scales=list(x=list(cex=0.8,relation = "free"), 
                 y =list(cex=0.8,axs = 'i',relation = "free")),
                 xlim=c(num.m,num.p),                                       
                 col = col.lines,
                 auto.key = list(points=F, lines=T, cex=0.8, columns = 2),
                 par.settings = list(superpose.symbol = list(pch = c(16), col=col.lines, cex =0.7),
                                     superpose.line = list(col = col.lines, lwd=c(2,2,2,2,2,2),
                                                           lty = c(1,1,1,1,1,1,1,1,1))),
                 cex.axis=1.0,
                 par.strip.text=list(cex=0.9, font=1, col='black',font=2),
                 aspect=1.0,
                 between=list(y=0.5, x=0.5),
                 index.cond = list(c(1,4,6,7,2,3,5,8,9)),
                 lwd=1,
                 ylab = list(label = "Weighted Motif Density", cex =1,font=2),
                 xlab = list(label = 'Distance from ATAC-seq Peak Summit', cex =1,font=2),
                 strip = function(..., which.panel, bg) {
                     bg.col = 'grey'
                     strip.default(..., which.panel = which.panel, 
                                    bg = rep(bg.col, length = which.panel)[which.panel])
                 }
                 ))
    dev.off()
}

bed.window <- function(bed, half.window) {
  bed[,2] = (bed[,2]+bed[,3])/2-half.window 
  bed[,3] = bed[,2] + 2 * half.window 
  return(bed)
  }

load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/plot.df.atac.Rdata')
load('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/fimo.scores.atac.Rdata')

plot.df = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    temp = plot.df.atac[plot.df.atac$genes %in% rownames(fimo.scores.atac[!is.na(fimo.scores.atac[,i]),]),]
    temp$family = colnames(fimo.scores.atac)[i]
    plot.df = rbind(plot.df,temp)
}

plot.df$status = 'Activated'
plot.df[plot.df$supercluster == 'down.up.down' | plot.df$supercluster == 'down.up',]$status = 'Repressed'
all.fimo = data.frame(matrix(ncol = 4, nrow = 0))
colnames(all.fimo) = c('density', 'tf', 'category', 'range')

half.win = 600
file.suffix = '_mm39_instances.bigWig'
dir = '/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/fimo_top_database_PSWM_top'
setwd(dir)

gradual.up = plot.df[plot.df$supercluster == 'gradual.up',12:14]
gradual.up[,2] = as.numeric(gradual.up[,2])
gradual.up[,3] = as.numeric(gradual.up[,3])
gradual.up = bed.window(gradual.up,half.win)

up.down = plot.df[plot.df$supercluster == 'up.down',12:14]
up.down[,2] = as.numeric(up.down[,2])
up.down[,3] = as.numeric(up.down[,3])
up.down = bed.window(up.down,half.win)

up.down.up = plot.df[plot.df$supercluster == 'up.down.up',12:14]
up.down.up[,2] = as.numeric(up.down.up[,2])
up.down.up[,3] = as.numeric(up.down.up[,3])
up.down.up = bed.window(up.down.up,half.win)

down.up = plot.df[plot.df$supercluster == 'down.up',12:14]
down.up[,2] = as.numeric(down.up[,2])
down.up[,3] = as.numeric(down.up[,3])
down.up = bed.window(down.up,half.win)

down.up.down = plot.df[plot.df$supercluster == 'down.up.down',12:14]
down.up.down[,2] = as.numeric(down.up.down[,2])
down.up.down[,3] = as.numeric(down.up.down[,3])
down.up.down = bed.window(down.up.down,half.win)


not.different = read.table('/home/FCAM/ssun/motifanalysis/meme_motif_enrichment2/csite_minsite/nondynamic_peaks.bed')
not.different = not.different[not.different$V1 != 'chrM',]
not.different = bed.window(not.different,half.win)


all.fimo = data.frame()
for(i in 1:ncol(fimo.scores.atac)) {
    factor = colnames(fimo.scores.atac)[i]

    mod.bigWig = paste0(factor,file.suffix)
    factor.name = factor
    print(factor.name)

    loaded.bw = load.bigWig(mod.bigWig)
    
    
    gradual.up.inten = bed.step.probeQuery.bigWig(loaded.bw, gradual.up,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    gradual.up.query.df = data.frame(cbind(colMeans(gradual.up.inten), factor.name,
                                    'gradual.up', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(gradual.up.query.df) = c('density', 'tf', 'category', 'range')
    
    
    up.down.inten = bed.step.probeQuery.bigWig(loaded.bw, up.down,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    up.down.query.df = data.frame(cbind(colMeans(up.down.inten), factor.name,
                                    'up.down', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(up.down.query.df) = c('density', 'tf', 'category', 'range')
    
    
    up.down.up.inten = bed.step.probeQuery.bigWig(loaded.bw, up.down.up,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    up.down.up.query.df = data.frame(cbind(colMeans(up.down.up.inten), factor.name,
                                    'up.down.up', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(up.down.up.query.df) = c('density', 'tf', 'category', 'range')
    
    
    down.up.inten = bed.step.probeQuery.bigWig(loaded.bw, down.up,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    down.up.query.df = data.frame(cbind(colMeans(down.up.inten), factor.name,
                                    'down.up', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(down.up.query.df) = c('density', 'tf', 'category', 'range')
    
    
    down.up.down.inten = bed.step.probeQuery.bigWig(loaded.bw, down.up.down,
                                           gap.value = 0, step = 10, as.matrix = TRUE)
    down.up.down.query.df = data.frame(cbind(colMeans(down.up.down.inten), factor.name,
                                    'down.up.down', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(down.up.down.query.df) = c('density', 'tf', 'category', 'range')
    
    
    
    ctrl.inten = bed.step.probeQuery.bigWig(loaded.bw, not.different,
                                            gap.value = 0, step = 10, as.matrix = TRUE)
    ctrl.query.df = data.frame(cbind(colMeans(ctrl.inten), factor.name,
                                     'unchanged', seq(-half.win, (half.win-10), 10)), stringsAsFactors=F)
    colnames(ctrl.query.df) = c('density', 'tf', 'category', 'range')
    
    
    
    
    tf.all = rbind(gradual.up.query.df, up.down.query.df, up.down.up.query.df, down.up.query.df, down.up.down.query.df, ctrl.query.df)

    all.fimo = rbind(all.fimo, tf.all)
}

all.fimo[,1] = as.numeric(all.fimo[,1])
all.fimo[,4] = as.numeric(all.fimo[,4])

plot.fimo.lattice(all.fimo, num.m = -500, num.p = 500,
                  col.lines = c('blue','cornflowerblue','tomato3','grey65','red','lightsalmon'))

motif_density_bycluster

13.1.1 Q: is a motif specific to clusters/group of clusters that share kinetic profiles?

CRE, NFYB, klf13, etv1, fosl2

13.1.2 Plot traces of isolated peaks

only for the five candidates motif: CRE, NFYB, klf13, etv1, fosl2

plot.df = data.frame()

fimo.scores.atac.subset<-fimo.scores.atac[, c(1,2,4,5,7)]

for(i in 1:ncol(fimo.scores.atac.subset)) {
      scores.temp = fimo.scores.atac.subset[!is.na(fimo.scores.atac.subset[,i]),]
      scores.temp = scores.temp[,-i]
      scores.temp = scores.temp[which(rowSums(is.na(scores.temp)) == ncol(scores.temp)),]
      temp = plot.df.atac[plot.df.atac$genes %in% rownames(scores.temp),]
      temp$family = colnames(fimo.scores.atac.subset)[i]
      plot.df = rbind(plot.df,temp)
}

plot.df$genes = as.factor(plot.df$genes)
cat.colours = plot.df[plot.df$merge == 'one_groupt0',]
cat.colours$genes <- as.factor(cat.colours$genes)
cat.colours$supercluster <- as.factor(cat.colours$supercluster)

cat.colours$colour[cat.colours$supercluster == 'up.down.up'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'up.down'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'gradual.up'] <- '#FF000008'
cat.colours$colour[cat.colours$supercluster == 'down.up.down'] <- '#0000FF08'
cat.colours$colour[cat.colours$supercluster == 'down.up'] <- '#0000FF08'

cat.colours$colour <- as.factor(cat.colours$colour)

cat.colours <- cat.colours[match(levels(plot.df$genes), cat.colours$genes), ]
pdf(file = 'five_isolated_families_dynamic_accessibility.pdf',width=14,height=14)

trellis.par.set(box.umbrella = list(lty = 1, col="black", lwd=1),
                box.rectangle = list( lwd=1.0, col="black", alpha = 1.0),
                plot.symbol = list(col="black", lwd=1.0, pch ='.'))

print(
    xyplot(value ~  sample.conditions | family, group = genes, data = plot.df, type = c('l'), #type = c('l','p'),
       scales=list(x=list(cex=1.5,relation = "free", rot = 45), y =list(cex=1.5, relation="free")),
       aspect=1.0,
       layout = c(3,2),
       between=list(y=0.5, x=0.5),
       
       main = list(label = 'Three Isolated Family Motifs in Dynamic Peaks', cex = 2.0),
       ylab = list(label = 'Normalized ATAC signal', cex =2.0),
       xlab = list(label = 'Time (minutes)', cex =2.0),
       par.settings = list(superpose.symbol = list(pch = c(16),col=c('grey20'), cex = 2.0),
                           strip.background=list(col="grey80"),
                           superpose.line = list(col = as.character(cat.colours$colour), lwd=c(1),lty = c(1))),
       panel = function(x, y, ...) {
           panel.xyplot(x, y, ...)
           #panel.bwplot(x, y, pch = '|', horizontal = FALSE, box.width = 15, do.out = FALSE)
           #panel.spline(x, y, col ='blue', lwd =2.0, ...) 
           
       })

)
dev.off()

traces